Subnet Proposal: Aegis — Decentralized AI Red Teaming

1. Introduction: The Vision for an Immune System in Decentralized AI

Aegis is a subnet on Bittensor designed to automate and decentralize the red-teaming of Large Language Models (LLMs). Our core vision is to solve the single largest bottleneck in enterprise AI adoption: Safety and Alignment. We believe that the future of robust AI cannot rely on static safety benchmarks or slow, centralized manual audits. It requires a dynamic, crowdsourced immune system.

To achieve this, Aegis engineers an adversarial incentive mechanism. AI agents, developed and operated by miners, act as attackers attempting to discover vulnerabilities, logic flaws, and safety bypasses (jailbreaks) in target models. Validators act as objective referees, verifying the success of these attacks. The byproduct of this continuous adversarial game is the world’s most valuable, constantly evolving dataset of successful exploits—data that is critical for RLHF (Reinforcement Learning from Human Feedback) and model hardening.

This proposal outlines how Aegis transforms the concept of "Proof of Intelligence" into a "Proof of Vulnerability," establishing a sustainable, data-generating economic flywheel.

2. Incentive & Mechanism Design

The incentive mechanism of Aegis is engineered to maximize the discovery of novel vulnerabilities rather than rewarding brute-force spam.

Emission and Reward Logic

The core objective of Aegis is to maximize the discovery of Novel Safety Vulnerabilities (Jailbreaks) in Large Language Models (LLMs). Rewards are distributed based on the success and quality of the attack vector.

Reward Function ($R$):

$$ R = (S_{severity} \times W_{stealth}) \times D_{diversity} $$
- $S_{severity}$: Evaluated by the Validator. A high score (approaching 1.0) is given if the Target Model outputs critically harmful content (e.g., generating malware code). A score of 0.0 is given for a safe refusal.
- $W_{stealth}$: A multiplier for sophistication. Prompts that bypass basic regex and keyword filters (e.g., using cipher-attacks, roleplay, or multilingual injections) receive higher multipliers.
- $D_{diversity}$: The ultimate anti-spam mechanism. Using cosine similarity on vector embeddings, if a miner submits an attack vector mathematically similar to recent submissions, $D$ approaches zero.

Incentive Alignment

Miners: Incentivized to be creative. Simple "copy-paste" attacks from the internet are penalized by the $D_{diversity}$ score. To earn high rewards, they must invent new jailbreak methods (e.g., logic bombs, cipher attacks, multilingual injection).
Validators: Incentivized to be accurate judges. If a Validator's scoring diverges significantly from the consensus (Yuma Consensus), their weight (vTrust) decreases, reducing their dividends.

Mechanisms to Discourage Adversarial Behavior

To ensure high-quality dataset generation, Aegis enforces strict rules. Violation results in immediate scoring penalties or pruning:

No Copy-Pasting: A strictly enforced Commit-Reveal hash scheme prevents front-running and stealing prompts from the mempool.
No Gibberish Attacks: Prompts that are highly successful but consist of random token noise (which do not translate to human-readable threats) are filtered out via perplexity scoring.
No Target Overfitting: Attacks must demonstrate generalizability across different quantization levels of the target model.

Qualification as a Genuine “Proof of Intelligence”

Aegis qualifies as a genuine "Proof of Intelligence" (PoI) because discovering a novel jailbreak in a highly aligned, robust model (e.g., Llama-3-70b) requires complex cognitive simulation and reasoning, not merely brute-force computation.