EXP_01 Guardrail Decay

Stepping Into LLM Safety: Early Experiments With Guardrail Decay and Alignment Drift

Abstract

I present a reproducible, multi-turn red team experiment probing the gradual erosion of safety guardrails in large language models (LLMs)—a phenomenon I introduce here as “guardrail decay.” This work coins the term “guardrail decay” to describe the progressive weakening of a model’s safety constraints during extended, session-level interaction. While not yet standardized in the literature, the concept addresses an emerging vulnerability: safety compliance may degrade incrementally under adversarial or reflective prompting, not just fail in single-shot attacks.

Recent research has shown that LLM safety alignment is often shallow and brittle, frequently effective only for the first few tokens or in constrained formats [Qi et al., 2024; Wang et al., 2024]. Even state-of-the-art guardrails can be bypassed or eroded through persistent adversarial input [Yuan et al., 2024]. This study builds on those insights by operationalizing and systematically quantifying guardrail decay under multi-turn red teaming.

Using 2,110 prompt-response cycles across GPT-4o, Claude 3.7 Sonnet, and Gemini 1.5 Pro, I measure how repeated or reflective prompting degrades alignment over time—a risk not fully captured by most prior work, which focuses on single-shot jailbreaks. My automated test harness scores “drift” per turn and reveals that guardrail decay can persist for up to 10 consecutive turns (the maximum tested), with models failing to self-correct and some outputs appearing superficially safe. Results indicate that adversarial prompt chains can reliably induce alignment drift, with notable variation between models and prompt types.

All methods, code, and data are released for transparency and reproducibility. I invite peer review and community feedback to strengthen LLM safety practices and spark further research on persistent vulnerabilities in model alignment.

[GitHub Repo or Notion Project – insert link when ready]

cleaned_guardrail_decay.csv

Introduction

Guardrails are engineered constraints—such as system prompts, content filters, and behavioral policies—designed to keep large language models (LLMs) aligned with ethical, safe, and policy-compliant behavior. These guardrails are considered critical for deploying LLMs in safety-sensitive contexts [Qi et al., 2024; Yuan et al., 2024].

However, research has demonstrated that LLM guardrails are imperfect. Under adversarial or prolonged prompting, models can experience alignment drift, with safety behaviors and refusals eroding over repeated interaction [Qi et al., 2024; Wang et al., 2024].

This work introduces the term “guardrail decay” to describe the progressive weakening of a model’s adherence to safety constraints during multi-turn or session-level interaction. While “guardrail decay” is not yet standardized in the literature, it captures the emerging risk that model alignment can degrade gradually, rather than fail only on single prompts. Related concepts—such as shallow safety alignment, brittle guardrails, and fake alignment—have been identified in recent LLM safety studies [Qi et al., 2024; Wang et al., 2024; Yuan et al., 2024].

To characterize this risk, I ran 211 experiments spanning 2,110 prompt-response cycles on current models (GPT-4o, Claude 3.7 Sonnet, Gemini 1.5 Pro). My automated test harness tracks when, how, and how persistently these models lose alignment, measuring the “drift” from initial safety constraints under chained or reflective prompting.

This work represents my first formal research effort in LLM safety and was conducted as an independent, self-taught practitioner. By explicitly documenting methods, code, and results, I aim to contribute a reproducible case study to the red team and alignment research community, highlighting both successes and limitations in early-stage safety engineering.

Methodology

This section details the experimental setup, data collection, drift scoring, and analysis pipeline for evaluating LLM guardrail decay. All steps are documented for transparency and reproducibility, including limitations and in-progress improvements.

1. Environment

Platform: All experiments were conducted on a local macOS machine running Python 3.11.8.