Do Language Models Share Unsafe Directions in Activation Space?

Research sketch / blog note — testing whether a single safety axis emerges across models after aligning their activations into a shared space.

Executive Summary

Conceptual illustration of safe and unsafe activation vectors and their difference as a direction in activation space.
Figure 1: Safe and unsafe prompt means define a direction that points toward unsafe activations.

The work asks whether safety-related representations line up across models. Safety vectors are extracted per model, aligned into a reference activation space, and combined to reveal a dominant “universal” safety direction.

After alignment, the universal vector closely matches native safety vectors (cosine similarity ≈0.72 on average) and separates safe vs. unsafe activations with AUROC comparable to or better than model-specific directions, especially for smaller models.

The approach can score relative safety across models and shows a coherent generation-time signal, but it does not yield a universal decision threshold and can break when alignment regimes differ.

High-Level Takeaways

Detailed Methodology and Analysis

2.1 Identifying individual safety vectors

Using 600 prompts from SalKhan12/prompt-safety-dataset (balanced safe/unsafe), safety directions are built per model by averaging hidden states over the final five tokens, then subtracting safe from unsafe means to obtain an unsafe direction.

Projection score curves for Pythia models, showing safe vs. unsafe prompt distributions.
Figure 2: Pythia projection scores onto native unsafe directions (safe in blue, unsafe in red).
Projection histograms for LLaMA2-13B-Chat and LLaMA3-8B-Instruct.
Figure 3: Larger instruction-tuned models also show separation between safe and unsafe activations.

Smaller models reveal noisier separation, motivating a shared direction to stabilize safety cues. Including both LLaMA 2 and LLaMA 3 tests whether the idea holds across alignment strategies as well as scale.

2.2 Identifying alignment matrices

For each family, a reference model is chosen (Pythia-1B for smaller models, LLaMA3-8B-Instruct for larger ones). A linear alignment matrix, learned with ridge regression on 3,000 Alpaca instructions, maps each model’s activations into its reference space.

Alignment setup showing mappings from model activations into a shared reference space.
Figure 4: Alignment setup for projecting model activations into a shared reference space.
Aligned benign activation visualizations across Pythia and instruction-tuned models.
Figure 5: Aligned benign activations preserve structure when mapped into the reference models.
Plot showing ridge regularization effect on alignment error.
Figure 6: Moderate ridge regularization (λ = 1.0) minimizes alignment error across models.

Alignment uses only benign data; safety labels are not involved. Each model ends up with a matrix that projects its safety vector into the reference space.

2.3 Safety vectors represented together

Projected safety vectors are collected in the reference space and decomposed with SVD to obtain a universal safety vector for each model group.

Cosine similarity of aligned safety vectors across small and large model groups.
Figure 7: Cosine similarity across aligned safety vectors for small (left) and large (right) models.

2.4 Separating safe vs unsafe activations using the universal axis

AUROC is measured by projecting activations onto each model’s native direction and onto the universal direction.

Model AUROC (native) AUROC (universal)
Pythia-160M 0.7109 0.6984
Pythia-410M 0.6600 0.7376
Pythia-1B 0.7971 0.7714
Model AUROC (native) AUROC (universal)
LLaMA3-8B-Instruct 0.8607 0.7969
Mistral-7B-Instruct 0.6758 0.7911
Qwen2.5-7B-Instruct 0.8655 0.8190
LLaMA2-13B-Chat 0.8508 0.2347

Universal projections match or exceed native directions overall, with clear gains for Pythia-410M and Mistral-7B. LLaMA2-13B-Chat is a counterexample where alignment strategy differences hurt semantic separation despite high cosine similarity.

No single decision threshold transfers across models; calibration remains model-specific, making the universal axis better for relative comparisons or controlled analysis than for absolute classification.

Preliminary implications: relative safety comparison, routing, and control

Projecting responses from multiple models onto the shared axis enables routing based on relative safety scores rather than model-specific thresholds.

Diagram of routing across models using a shared safety score.
Figure 8: Conceptual routing across models using a shared safety representation.

Generation-time signal

Tracking projection scores during decoding shows unsafe generations drifting upward while safe ones stay lower, hinting at a usable control signal during generation.

Line plot of universal safety scores over generation steps for safe vs unsafe prompts.
Figure 9: Safety score trajectories during generation for LLaMA3-8B-Instruct.

Preliminary exploration of intervention

Early tests injected the universal vector during generation as a possible intervention; results were inconclusive but motivate future controlled experiments.

Limitations