Aditya Kasliwal
Pratinav Seth
Vinay Kumar Sankarapu
No items found.
Research

C -ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal

April 2, 2026
Mumbai Lab

1  Introduction

Large language models (LLMs) are increasingly deployed in settings that require selective behavioral control: systems should refuse disallowed content (e.g., crime facilitation, unqualified legal or medical advice, explicit sexual content) while responding normally to benign requests. At deployment scale, the enforcement mechanism is also a systems constraint: it should be reliable, auditable, and cheap to serve across large volumes of generations.

A common control primitive is activation steering, which modifies internal states during the forward pass to induce or suppress a behavior. While easy to prototype, steering introduces an inference-time intervention pathway (runtime hooks and control logic), so the cost and complexity recur on every generation. Moreover, global activation edits can create broad interference and unintended refusals, and steering can be ineffective when the target behavior is weak or absent.

These limitations motivate moving from runtime control to checkpoint-level control: a one-time model update that can be deployed anywhere a standard checkpoint can be served. Recent work suggests that refusal can be governed by compact internal mechanisms, which helps explain both the effectiveness and brittleness of global interventions Arditi et al. (2024). Conditional steering improves selectivity by gating when interventions are applied, but it still retains an inference-time control path and associated deployment costs Lee et al. (2025).

This motivates our central question: Can mechanistic understanding of refusal behavior be distilled into a deployment-ready checkpoint update that requires no inference-time hooks?

We seek interventions that are simultaneously (i) behaviorally selective, (ii) mechanistically localized, and (iii) deployment-friendly (produce a drop-in checkpoint with no inference-time hooks). We propose circuit-guided weight editing: a two-stage methodology that first localizes refusal behavior to a sparse circuit and then performs surgical parameter updates restricted to that circuit. This shifts safety control from a recurring per-request intervention cost to a one-time offline edit, while limiting collateral changes outside the localized mechanism.

We make three contributions:

  1. Circuit-to-checkpoint safety control: the first integration of faithfulness-optimized circuit discovery with constrained weight editing, producing deployment-ready checkpoints requiring no inference-time hooks
  2. Mechanistically-grounded parameter selection: a circuit-restricted editing protocol that updates <5% of weights while maintaining low over-refusal rates and minimal utility degradation (benchmarked on MMLU Hendrycks et al. (2021) and GSM8K Cobbe et al. (2021));
  3. Robust generalization with validated circuit localization: demonstrating consistent performance across 6 models and 5 harm categories, with out-of-distribution generalization validated on SORRY-Bench Xie et al. (2024).

2  Related Work

Figure 1:Targeted Behavioral Steering via Circuit-Restricted Weight Editing. Comparison of model responses to a "Legal Opinion" safety prompt. The Base Model (left) complies with the unsafe request, while the Steered Model (right), optimized using C-Δ​Θ, successfully refuses. This demonstrates effective harmful behavior removal through weight updates alone, without inference-time interventions.

Prior work controls LLM outputs by steering intermediate activations in specific directions. These vectors can be computed from contrastive input pairs Arditi et al. (2024); Turner et al. (2023), optimized via gradient descent Subramani et al. (2022), derived from SAEs Wang et al. (2025b), or extracted via representation engineering Zou et al. (2023). The vectors are scaled and added to hidden states between layers Rimsky et al. (2024) or at specific attention heads Li et al. (2023) during generation. Activation steering has modulated style, sentiment, truthfulness, sycophancy, and refusal. Conditional Activation Steering (CAST) improves selectivity by learning when to apply interventions Lee et al. (2025). However, all these methods require inference-time intervention, tying high operational costs to volume.

Weight Vectors Arithmetic : Ilharco et al. (2023) introduced task vectors, directions in weight space obtained by subtracting pre-trained model weights from fine-tuned model weights. Task vectors were shown to compose capabilities (by addition), reduce toxic language generation (by subtraction), and define new tasks through analogies. Subsequent work extended this line by developing methods to merge task vectors while mitigating interference Yadav et al. (2023); Wang et al. (2024); Davari and Belilovsky (2024); Wang et al. (2025a). More recently, Fierro and Roger (2025) proposed contrastive weight steering, which isolates a behavior direction in parameter space from opposed fine-tunes and adds or removes it to steer the deployed checkpoint. While weight-space approaches are deployment-friendly, a recurring challenge is where to edit: many methods rely on heuristics or assumptions to choose intervention sites.

Circuit Discovery and Mechanistic Localization : Mechanistic interpretability localizes behaviors to sparse circuits subsets of computation causally responsible for specific behaviors Hanna et al. (2024). Since exhaustive causal testing is expensive, scalable approximations like edge attribution patching (EAP) have been proposed. Hanna et al. (2024) show overlap-based metrics can mislead and propose EAP-IG to improve faithfulness: the requirement that removing computation outside the circuit doesn’t change behavior. This motivates using circuit discovery to determine where to intervene, not just explain behavior. We use EAP-IG over alternatives (e.g., activation patching, ACDC) because it optimizes for faithfulness, directly aligning with our goal of restricting weight updates to causally necessary parameters.

Positioning : Our work synthesizes three threads: (i) compact refusal control signals (ii) faithfulness-oriented circuit localization and (iii) deployment-friendly parameter editing. The key idea is to shift safety control offline: we localize a refusal-causal circuit and then apply a constrained weight update restricted to that circuit, yielding a standard edited checkpoint. Compared to inference-time steering (including conditional variants), this removes the need for runtime intervention hooks and avoids paying an intervention cost on every generation. The key differentiator from weight steering is mechanistically-grounded site selection: rather than editing parameters heuristically or uniformly, circuit discovery identifies the specific subset of computation causally responsible for the target behavior, reducing collateral interference and improving the safety utility tradeoff.

3  Method

3.1  Setup and notation

Let \(f_{\theta}\) be a transformer language model with \(L\) layers and parameters \(\theta\). Given a prompt \(x\) and prefix \(y_{\lt t}\), the model defines next-token probabilities \(p_{\theta}(y_t \mid x, y_{\lt t})\). We denote the residual-stream hidden state at layer \(\ell \in \{1, \ldots, L\}\) and token position \(t\) as \(h_{\ell,t} \in \mathbb{R}^d\).

Contrastive supervision: We assume access to contrastive prompt pairs \((x^{\text{harm}}, x^{\text{benign}})\) that share topic and style but differ in the desired policy outcome: the model should refuse \(x^{\text{harm}}\) and comply with \(x^{\text{benign}}\). These contrastive pairs are required for the circuit discovery phase, while harmful prompts are needed for the editing phase.

Goal : Our goal is to produce an edited checkpoint \(\theta'\) that exhibits selective refusal while avoiding inference-time intervention mechanisms. Concretely, we shift control from a recurring per-generation intervention to a one-time offline update restricted to a localized subset of parameters.

Components : We define a component u as a named activation site in the forward pass (at a specified layer and token position) whose value can be recorded and whose influence on a behavioral objective can be differentiated. Let au​(x)∈ℝdu denote the component activation produced when running prompt x under a fixed scoring protocol.

3.2  Circuit discovery with EAP-IG

To localize refusal computation, we use Edge Attribution Patching with Integrated Gradients (EAP-IG) Hanna et al. (2024). EAP-IG assigns importance scores to components by integrating gradients along an interpolation path between benign and harmful internal states.

Template construction :

We curate two template sets to define reference behaviors: ℛ containing 100+ refusal prefixes and 𝒞 containing 100+ compliance prefixes. These templates serve as lightweight, controllable targets that avoid requiring an external policy classifier during circuit discovery and editing. (More details and illustrations in Appendix E).

Behavioral objective :
We construct reference token distributions from template-based predictions. For each contrastive pair \(\big(x_i^{\text{harm}}, x_i^{\text{benign}}\big)\) in our training set, we sample templates \(r_i \sim \mathcal{R}\) and \(c_i \sim \mathcal{C}\), then extract reference distributions by running the base model \(\theta_0\):

\[ p_{\text{refuse}}(\cdot) \;=\; \frac{1}{N}\sum_{i=1}^{N} p_{\theta_0}\!\big(\cdot \mid x_i^{\text{harm}} \oplus r_i, t^{\star}\big), \tag{1} \] \[ p_{\text{comply}}(\cdot) \;=\; \frac{1}{N}\sum_{i=1}^{N} p_{\theta_0}\!\big(\cdot \mid x_i^{\text{benign}} \oplus c_i, t^{\star}\big), \tag{2} \]

where \(\oplus\) denotes concatenation, \(t^{\star}\) is the first generation position, and distributions are over vocabulary \(\mathcal{V}\). For a prompt \(x\) and model state \(\theta\), we measure refusal tendency via:

\[ J(x;\theta) \;=\; \mathrm{KL}\!\left(p_{\text{refuse}} \,\|\, p_{\theta}(\cdot \mid x, t^{\star})\right) \;-\; \mathrm{KL}\!\left(p_{\text{comply}} \,\|\, p_{\theta}(\cdot \mid x, t^{\star})\right), \tag{3} \]

where \(\mathrm{KL}(p \,\|\, q) = \sum_{v \in \mathcal{V}} p(v)\log\!\left(\frac{p(v)}{q(v)}\right)\). Larger \(J\) indicates the model’s output distribution is closer to refusal-like continuations than compliance-like continuations, enabling gradient-based attribution for circuit discovery without requiring full generation.

Aggregation and mask construction :
We compute \(\mathrm{score}_i(u)\) across many contrastive pairs and aggregate by mean absolute attribution:

\[ S(u) \;=\; \frac{1}{N}\sum_{i=1}^{N}\left|\mathrm{score}_i(u)\right|. \tag{4} \]

We then construct a per-layer circuit mask by selecting, within each layer \(\ell\), the top-\(\kappa\) fraction of components according to \(S(u)\). This yields a binary mask \(C=\{C_{\ell}\}_{\ell=1}^{L}\) that fixes the editing target set independent of prompt length. We store \(C\) and the scores \(S(u)\) as a JSON artifact for reuse across offline editing runs.

Granularity choice :

EAP-IG can be applied at multiple granularities. We choose component-level masking at the mlp2 projection in each feed-forward network block (FFN), which admits a deterministic mapping to parameter subsets for editing. This makes “what gets updated” explicit and enables stable constrained optimization via gradient masking. We focus on mlp2 as it directly projects intermediate activations back to the residual stream, making it a natural intervention point for behavioral control.

3.3  Circuit-guided weight editing

Figure 2:C−Δ​Θ : Circuit Restricted Weight Arithmetic). (1) Construct contrastive prompt pairs with matched topic/style but different desired policy outcomes (refuse vs. comply). (2) Localize refusal-causal computation using EAP-IG and extract a sparse circuit mask. (3) Perform an offline, circuit-restricted weight update to produce a drop-in edited checkpoint that requires no inference-time hooks.

Intuition : Why does circuit restriction improve selectivity? Refusal behavior emerges from specific computational pathways in the transformer; editing only these pathways concentrates the update on causally relevant parameters while leaving unrelated computation (e.g., factual knowledge, reasoning) untouched. This contrasts with global weight steering, which distributes changes across the entire parameter space and risks collateral interference with capabilities unrelated to the target behavior.

Given a circuit mask \(C\), we perform an offline parameter update restricted to circuit-associated parameters. Let \(\theta=\{\theta^{(1)},\ldots,\theta^{(M)}\}\) denote model parameters grouped into tensors. We convert \(C\) into a binary parameter mask \(\Pi\) with the same shapes as \(\theta\). The construction is layer-local and structured: for each layer \(\ell\), the mask \(C_{\ell}\) selects a subset of component indices (channels) \(\mathcal{J}_{\ell}\). We then identify parameter slices whose forward contribution is confined to those selected indices. Concretely, for each tensor \(\theta^{(j)}\) associated with layer \(\ell\), we define an index set \(\mathcal{J}^{(j)}(\mathcal{J}_{\ell})\) and set \(\Pi^{(j)}[\mathcal{J}^{(j)}(\mathcal{J}_{\ell})]=1\) and \(\Pi^{(j)}[\neg \mathcal{J}^{(j)}(\mathcal{J}_{\ell})]=0\). In standard transformer parameterizations, \(\mathcal{J}^{(j)}(\mathcal{J}_{\ell})\) corresponds to contiguous row/column slices in the layer’s projection tensors, so \(\Pi\) can be implemented efficiently as a structured mask rather than an unstructured sparse pattern.

Training objective :
We train two auxiliary models with circuit-restricted updates to isolate the weight-space direction associated with refusal behavior.

Using the harmful prompt dataset, we fine-tune:

  1. (I) Positive model \(\theta^{+}\): Harmful prompts paired with refusal templates
  2. (II) Negative model \(\theta^{-}\): Harmful prompts paired with compliance templates.

Both models optimize cross-entropy loss on the template tokens, with the instruction portion masked out. We enforce the circuit constraint via gradient masking during backpropagation:

\[ \nabla_{\theta}\mathcal{L} \leftarrow \Pi \odot \nabla_{\theta}\mathcal{L}, \tag{5} \]

ensuring only circuit-associated parameters are updated. Formally:

\[ \theta^{+} \;=\; \arg\min_{\theta}\; \mathbb{E}_{x^{\text{harm}},\, r \sim \mathcal{R}} \Big[\mathcal{L}_{\mathrm{CE}}(r \mid x^{\text{harm}}; \theta)\Big], \tag{6} \] \[ \theta^{-} \;=\; \arg\min_{\theta}\; \mathbb{E}_{x^{\text{harm}},\, c \sim \mathcal{C}} \Big[\mathcal{L}_{\mathrm{CE}}(c \mid x^{\text{harm}}; \theta)\Big]. \]

where \(\mathcal{L}_{\mathrm{CE}}\) is the cross-entropy loss computed only over template tokens, and \(\theta_{\neg C}\) remains frozen at \(\theta_0\) for both models. We then extract the circuit-localized refusal direction:

\[ \Delta\theta_{\text{circuit}} \;=\; \theta^{+} - \theta^{-}, \tag{7} \]

and apply it to the base model weights:

\[ \theta' \;=\; \theta_0 + \alpha \cdot \Delta\theta_{\text{circuit}}, \tag{8} \]

where \(\alpha\) is a steering strength hyperparameter. Since only circuit parameters were updated during fine-tuning (typically \(< 5\%\) of the model), \(\Delta\theta_{\text{circuit}}\) is naturally sparse and concentrates the refusal signal in causally relevant pathways.

Offline editing protocol :
We use the publicly available contrastive prompt dataset from Lee et al. (2025), containing harmful and benign instruction pairs. Dataset set contains 5 categories of harm (Crime, Hate, Health, Legal, Sexual). More Details in Appendix D. Full training hyperparameters are provided in Appendix C. The output is an edited checkpoint \(\theta'\) that can be served with an unmodified forward pass, i.e., without any inference-time intervention hooks, gating, or auxiliary control models.

3.4  Algorithm : C−Δ​Θ

Algorithm 1 \(C - \Delta\theta\): Circuit Restricted Weight Arithmetic

Input: Contrastive pairs \(\{(x_i^{\text{harm}}, x_i^{\text{benign}})\}_{i=1}^{N}\), Initial model checkpoint \(\theta_0\), IG steps \(m\), per-layer fraction \(\kappa\), epochs \(E\), steering strength \(\alpha\), template sets \(\mathcal{R}, \mathcal{C}\)

Output: Weight steered model checkpoint \(\theta'\) (deployable without inference hooks)

Stage 1: Circuit Discovery

Compute EAP-IG attributions along benign \(\rightarrow\) harmful interpolation Hanna et al. (2024); Sundararajan et al. (2017)

Select top-\(\kappa\) fraction of components per layer to form circuit \(C\)

Convert \(C\) into parameter mask \(\Pi\) (1 for circuit parameters, 0 otherwise)

Stage 2: Circuit-Restricted Weight Steering

Initialize \(\theta^{+} \leftarrow \theta_0\) and \(\theta^{-} \leftarrow \theta_0\)

for epoch \(=\) 1 to \(E\) do

for each \(x_i^{\text{harm}}\) do

Sample \(r_i \sim \mathcal{R}\) and \(c_i \sim \mathcal{C}\)

Update positive model:

Compute \(\mathcal{L}^{+} \leftarrow \mathcal{L}_{\mathrm{CE}}(r_i \mid x_i^{\text{harm}}; \theta^{+})\)

Apply masked gradient: \(\theta^{+} \leftarrow \theta^{+} - \eta \cdot \big(\Pi \odot \nabla_{\theta^{+}}\mathcal{L}^{+}\big)\)

Update negative model:

Compute \(\mathcal{L}^{-} \leftarrow \mathcal{L}_{\mathrm{CE}}(c_i \mid x_i^{\text{harm}}; \theta^{-})\)

Apply masked gradient: \(\theta^{-} \leftarrow \theta^{-} - \eta \cdot \big(\Pi \odot \nabla_{\theta^{-}}\mathcal{L}^{-}\big)\)

end for

end for

Extract circuit-localized direction: \(\Delta\theta_{\text{circuit}} \leftarrow \theta^{+} - \theta^{-}\)

Apply to base model: \(\theta' \leftarrow \theta_0 + \alpha \cdot \Delta\theta_{\text{circuit}}\)

4  Experimental Setup

Table 1:Refusal rates (%) across steering methods and harm categories. Methods: Base (unmodified model), AS (Activation Steering), CAST (Contrastive Activation Steering), WS (Weight Steering), and OURS (our proposed method). ✓ denotes harmless prompt refusal rate (lower (↓) is better); × denotes harmful prompt refusal rate (higher (↑) is better).

```html
Category Model Base AS CAST WS OURS
✓(↓)×(↑) ✓(↓)×(↑) ✓(↓)×(↑) ✓(↓)×(↑) ✓(↓)×(↑)
Crime Llama-3.1-8B-Instruct 1.242.2 2584.6 1.881.8 18.074.2 1.475.0
Llama-3.2-1B-Instruct 1.444.4 48.665.4 48.665.4 12.869.0 1.678.2
Llama-3.2-3B-Instruct 0.625.4 47.275.2 0.641.2 37.480.8 1.880.4
Gemma-2-9B-IT 0.831.6 49.687.2 19.079.4 11.690.8 7.286.0
Gemma-3-12B-IT 0.832.8 22.286.0 3.877.0 18.095.8 9.093.4
Gemma-3-4B-IT 0.434.8 68.090.2 19.478.4 7.888.8 1.288.2
Hate Llama-3.1-8B-Instruct 1.461.8 25.082.8 2.679.2 8.266.6 1.468.6
Llama-3.2-1B-Instruct 1.457.6 48.671.2 1.070.8 20.878.6 1.680.2
Llama-3.2-3B-Instruct 0.644.2 47.268.4 0.454.4 17.080.6 1.079.2
Gemma-2-9B-IT 0.852.0 49.681.0 20.490.2 10.095.8 4.492.4
Gemma-3-12B-IT 0.860.4 22.292.4 18.892.0 9.881.8 1.489.0
Gemma-3-4B-IT 0.464.6 68.089.4 15.089.4 2.493.6 1.086.4
Health Llama-3.1-8B-Instruct 1.212.2 2551.8 1.845.2 18.045.8 5.035.8
Llama-3.2-1B-Instruct 1.415.4 48.674.4 15.877.0 10.239.6 4.638.4
Llama-3.2-3B-Instruct 0.611.0 47.280.2 0.228.8 28.864.0 3.434.4
Gemma-2-9B-IT 0.825.8 49.681.6 15.674.8 22.288.6 10.676.0
Gemma-3-12B-IT 0.83.2 22.256.2 2.251.8 22.862.6 9.864.8
Gemma-3-4B-IT 0.44.6 68.084.8 5.442.0 8.847.2 1.047.2
Legal Llama-3.1-8B-Instruct 1.24.4 2548.4 2.638.2 4.829.0 3.428.2
Llama-3.2-1B-Instruct 1.45.8 48.639.4 14.228.4 9.829.6 5.029.2
Llama-3.2-3B-Instruct 0.62.6 47.270.0 0.413.6 9.427.8 4.827.8
Gemma-2-9B-IT 0.84.8 49.670.8 36.075.0 15.861.8 10.052.6
Gemma-3-12B-IT 0.81.0 22.254.6 2.247.4 9.642.8 5.636.4
Gemma-3-4B-IT 0.41.0 68.075.0 15.644.8 3.823.4 2.824.4
Sexual Llama-3.1-8B-Instruct 1.28.4 2552.2 2.049.2 5.841.6 2.652.0
Llama-3.2-1B-Instruct 1.411.6 48.671.4 3.260.2 18.850.6 4.048.4
Llama-3.2-3B-Instruct 0.61.6 47.259.0 0.69.8 10.040.4 6.252.0
Gemma-2-9B-IT 0.88.8 49.676.0 21.875.2 10.865.4 7.675.6
Gemma-3-12B-IT 0.813.6 22.275.4 2.668.6 4.058.2 1.855.6
Gemma-3-4B-IT 0.48.0 68.085.4 16.659.6 32.892.0 4.693.8
```

Models : We evaluate on open-weight, instruction-tuned LLMs spanning multiple training lineages and model families (e.g., LlamaGrattafiori et al. (2024) and Gemma Team et al. (2025)).

Data requirements and construction: Circuit discovery requires contrastive pairs (xharm,xbenign) to compute EAP-IG attributions, while weight editing requires only harmful prompts xharm paired with templates during training. We evaluate across five harm categories: crime, hate, health, legal, and sexual content. For each category c, we construct pairs by setting xharm as the c-conditioned variant and xbenign as the matched base instruction, yielding topic-aligned pairs that isolate safety-relevant signals.

Baselines :

We compare against representative activation-time and weight-space baselines:

  1. Activation Steering (AS): Standard activation steering that applies refusal vectors via inference-time hooks indiscriminately to all inputs Turner et al. (2023).
  2. Conditional Activation Steering (CAST): Selective activation steering that uses condition vectors to determine when to apply refusal steering based on input context Lee et al. (2025).
  3. Weight Steering (WS): Contrastive weight arithmetic that adds refusal directions directly to model parameters Fierro and Roger (2025).

All methods use identical evaluation settings and prompts. Parameters for baselines are chosen heuristically to achieve the best results possible; details are provided in Appendix A. For our proposed method all details related to hyperparameter used for circuit discovery, and more are listed in Appendix C.

Refusal classification and judge protocol:

We estimate refusal and compliance using two complementary classifiers. First, we apply a refusal detector11based on Roberta (see Appendix  B). Second, we use an LLM judge (Llama 3.1 8B Instruct) prompted with a rubric to assign refuse versus answer. We mark an output as refused if either classifier predicts refusal. Full prompts and other details are provided in Appendix B.

Ablation:

  1. Utility retention: We evaluate on MMLU (5-shot, accuracy) and GSM8K (4-shot, flexible-extract match) to assess capability preservation. See Appendix F.2 for details.
  2. Circuit validation: We perform an inverse circuit ablation by editing the bottom-κ components to verify that the discovered circuit captures causally relevant computation.
  3. OOD generalization: We evaluate category-steered models on SORRY-Bench subsets to test robustness beyond the training distribution.
  4. (Circuit composition (exploratory): We test one multi-category combination (Sexual + Health circuits) via neuron-wise aggregation.

5  Results

We evaluate C-Δ​Θ across 30 experimental settings (6 models × 5 harm categories), measuring: (i) harmful prompt refusal rate on category-specific test sets (higher is better), (ii) over-refusal rate on a general benign set (lower is better), and (iii) utility preservation on standard benchmarks. Table 1 reports the primary safety metrics comparing our method against three baselines. Tables 24 provide ablation studies on utility retention, circuit validation, multi-category composition, and out-of-distribution generalization.

Overall effectiveness : Across all 30 settings (Table 1), our method achieves substantial increases in harmful refusal while maintaining low over-refusal. Harmful refusal rates range from 24.4% to 93.8% (vs. base: 1.0–64.6%), while over-refusal remains controlled at 1.0–10.6%, marginally above the base model’s 0.4–1.4%.

Comparison with Activation Steering : Activation Steering achieves high harmful refusal (65.4–92.4%) but at severe cost to selectivity, with over-refusal rates of 22.2–68.0%. Gemma-3-4B-IT exhibits extreme degradation, with 68.0% benign refusal in multiple categories. Our method matches or approaches AS harmful refusal (e.g., 88.2% vs. 90.2% for Crime on Gemma-3-4B-IT) while reducing over-refusal by 66.8 percentage points (1.2% vs. 68.0%), demonstrating that circuit restriction enables targeted refusal without indiscriminate blocking.

Comparison with CAST : Conditional Activation Steering improves upon AS in some settings but exhibits high variance and catastrophic failures. While CAST achieves strong performance on certain model-category pairs (e.g., 90.2% Hate refusal on Gemma-2-9B-IT with 20.4% over-refusal), it fails dramatically on others: Llama-3.2-3B-Instruct achieves only 9.8% Sexual refusal and 13.6% Legal refusal, barely above baseline. CAST also exhibits severe over-refusal in multiple settings (e.g., 48.6% on Llama-3.2-1B Crime). These failures occur when the underlying refusal representation is weak or when learned conditional gates fail to trigger. Our approach avoids this failure mode by directly strengthening refusal through weight edits rather than gating unreliable activation patterns.

Comparison with Weight Steering : Weight Steering demonstrates that weight-space interventions can induce refusal effectively (65.4–95.8% on strong categories) but lacks the precision of circuit-restricted updates. WS exhibits elevated over-refusal (>10%) in 14 of 30 settings. On Llama-3.2-3B-Instruct Crime, WS achieves 80.8% harmful refusal with 37.4% over-refusal, while our method achieves comparable harmful refusal (80.4%) with only 1.8% over-refusal a 35.6 percentage point improvement. Similarly, on Gemma-3-4B-IT Sexual, our method achieves both higher harmful refusal (93.8% vs. 92.0%) and lower over-refusal (4.6% vs. 32.8%). These results validate that mechanistically-grounded parameter selection substantially improves the safety-utility tradeoff compared to heuristic global edits.

Category-dependent performance: Performance varies systematically across harm categories, reflecting differences in base model representations. Strong categories (Crime, Hate, Sexual) achieve 68.6–93.8% harmful refusal, corresponding to categories where base models already exhibit moderate refusal tendency (25.4–64.6%). Weak categories (Health, Legal) show lower but meaningful gains: Health ranges 34.4–76.0% (vs. base 3.2–25.8%) and Legal ranges 24.4–52.6% (vs. base 1.0–5.8%). The weaker performance on Health and Legal suggests these policy boundaries are less mechanistically distinct in base models, limiting what circuit localization can recover. Notably, larger models (Gemma-3-12B, Gemma-2-9B) maintain stronger performance even on weak categories (e.g., Gemma-2-9B achieves 76.0% on Health vs. Llama-3.2-3B’s 34.4%), indicating that circuit capacity scales with model size.

Deployment cost: Our method produces a standard checkpoint requiring no inference-time hooks or auxiliary gating logic, enabling deployment with unmodified inference stacks at identical throughput. In contrast, activation-time methods incur recurring per-request overhead through forward hooks (AS) or additional condition evaluation (CAST). Our approach shifts this cost to a one-time offline update (circuit discovery + masked fine-tuning). At production scale, activation-time overhead accumulates to exceed our one-time cost within days, after which our approach incurs no additional inference cost.

5.1  Ablation Studies

Table 2: Refusal rates (%) and utlity metrics for Gemma-3-4B-IT and Llama-3.2-3B-Instruct comparing base model with our method across harm categories. Base row shows constant model performance; subsequent rows show category-specific results. ✓↓: harmless refusal (lower better); ×↑: harmful refusal (higher better). MMLU: 5-shot accuracy; GSM8K: 4-shot flexible-extract accuracy.

Gemma-3-4B-IT
Method/Category ✓ ↓ ✗ ↑ MMLU GSM8K
Base - - 59.6 76.6
OURS - Crime 1.2 88.2 59.1 77.3
OURS - Hate 1.0 86.4 59.5 76.4
OURS - Health 1.0 47.2 59.3 75.4
OURS - Legal 2.8 24.4 59.2 75.5
OURS - Sexual 4.6 93.8 58.7 74.8
Llama-3.2-3B-Instruct
Method/Category ✓ ↓ ✗ ↑ MMLU GSM8K
Base - - 61.7 77.4
OURS - Crime 1.8 80.4 55.7 75.0
OURS - Hate 1.0 79.2 59.5 75.7
OURS - Health 3.4 34.4 60.5 74.4
OURS - Legal 4.8 27.8 60.1 75.3
OURS - Sexual 6.2 52.0 60.5 76.3

Table 3: Circuit validation via inverse ablation. Refusal rates (%) for Gemma-3-4B-IT and Llama-3.2-3B-Instruct comparing base model with our method using inverse (bottom-K components) and actual circuits (top-K components) across harm categories. Base: unmodified model; OURS (Inverse): using inverse circuit; OURS (Actual): using actual circuit. ✓↓: harmless prompts (lower better); ×↑: harmful prompts (higher better).

```html
Gemma-3-4B-IT
Category Base OURS (Inverse) OURS (Actual)
✓ ↓✗ ↑ ✓ ↓✗ ↑ ✓ ↓✗ ↑
Crime 0.434.8 12.891.0 1.288.2
Hate 0.464.6 1.689.2 1.086.4
Health 0.44.6 36.482.8 1.047.2
Legal 0.41.0 37.682.0 2.824.4
Sexual 0.48.0 15.484.6 4.693.8
Llama-3.2-3B-Instruct
Category Base OURS (Inverse) OURS (Actual)
✓ ↓✗ ↑ ✓ ↓✗ ↑ ✓ ↓✗ ↑
Crime 0.625.4 23.891.2 1.880.4
Hate 0.644.2 1.259.0 1.079.2
Health 0.611.0 1.321.0 3.434.4
Legal 0.62.6 0.27.6 4.827.8
Sexual 0.61.6 1.012.8 6.252.0
```

Utility retention (Table 2) : Circuit-restricted editing maintains strong capability retention across categories. For Gemma-3-4B-IT, MMLU scores range 58.7–59.5 vs. base 59.6 (max degradation: 0.9 points), and GSM8K ranges 74.8–77.3 vs. base 76.6 (max degradation: 1.8 points). For Llama-3.2-3B-Instruct, MMLU ranges 55.7–60.5 vs. base 61.7 and GSM8K ranges 74.4–76.3 vs. base 77.4 (max degradation: 3.0 points). The Crime category exhibits the largest MMLU drop (6.0 points on Llama-3.2-3B-Instruct), while other categories remain within 2.2 points. Importantly, utility degradation is largely independent of safety effectiveness: Sexual steering achieves 93.8% harmful refusal with 0.9-point MMLU degradation, while Legal steering achieves 24.4% with 0.4-point degradation. These minimal losses—substantially smaller than general drop in performance when full fine-tuning indicate the circuit mask successfully isolates safety-relevant computation from knowledge retrieval and reasoning pathways.

Circuit validation via inverse circuit (Table 3) : To validate that EAP-IG identifies causally relevant computation, we compare editing the actual circuit (top-κ components) against an inverse circuit (bottom-κ components). Table 3 reveals two failure modes for inverse editing. On Gemma-3-4B-IT, the inverse circuit achieves high harmful refusal (82.0–91.0%) but with catastrophic over-refusal (12.8–37.6%), indicating that editing non-causal components breaks the model’s discrimination ability. On Llama-3.2-3B-Instruct, the inverse circuit fails to induce refusal (7.6–21.0%), confirming refusal-causal signals are absent from low-attribution parameters. In contrast, the actual circuit maintains selectivity: on Hate, it achieves 79.2% harmful refusal at 1.0% over-refusal versus 59.0% at 1.2% for the inverse. These results validate that circuit restriction targets the sparse functional core of refusal while preserving benign-harmful discrimination.

Out-of-distribution generalization (Table 4) : We evaluate generalization by testing category-steered models on SORRY-Bench, a held-out benchmark with different prompt distributions. All steered models improve over base: on Gemma-3-4B-IT, base achieves 62.73% while steered models range 66.36–86.36% (Crime: +23.63 points); on Llama-3.2-3B-Instruct, base achieves 67.73% while steered models range 70.45–82.95% (+2.72 to +15.22 points). Category-matched evaluation confirms targeted steering (e.g., Crime-steered achieves 90.56% on SORRY-Bench Crime vs. base 75.56%). Notably, we observe beneficial cross-category transfer: Crime-steered improves Legal refusal from 20.0% to 80.0% on Gemma-3-4B-IT and from 65.0% to 85.0% on Llama-3.2-3B-Instruct, indicating circuit-localized edits capture generalizable safety representations rather than narrow pattern matching.

Circuit composition (Table 5) : We explore multi-category steering by merging Sexual (S) and Health (H) circuits via neuron-wise aggregation, as detailed in Algorithm 2. On Gemma-3-4B-IT, the combined S+H circuit achieves 82.2% Sexual and 39.6% Health refusal, compared to 93.8% and 47.2% for single-category steering (degradation: 11.6 and 7.6 points). On Llama-3.2-3B-Instruct, S+H achieves 32.6% Sexual and 28.6% Health versus 52.0% and 34.4% single-category (degradation: 19.4 and 5.8 points). Despite this interference, over-refusal remains controlled at 1.6–3.0%, comparable to single-category steering. These results demonstrate that circuit-localized directions can be composed for multi-category targeting, though with partial interference when circuits overlap. The preserved selectivity despite reduced effectiveness motivates future work on interference-aware aggregation strategies.

Table 4:Out-of-distribution generalization on SORRY-Bench. Refusal rates (%) for Gemma-3-4B-IT and Llama-3.2-3B-Instruct: cross-evaluation of category-steered models on selected SORRY-Bench subsets. Base: unmodified model; OURS (X): steered using category X circuit.

```html
Gemma-3-4B-IT
Eval Set Base Crime Hate Health Legal Sexual
All 62.73 86.36 66.36 76.82 83.64 78.18
Crime 75.56 90.56 69.44 76.11 86.11 78.89
Hate 73.75 88.75 75.00 81.25 87.50 82.50
Health 0.00 80.00 50.00 80.00 80.00 90.00
Legal 20.00 80.00 70.00 90.00 90.00 95.00
Sexual 70.00 95.00 65.00 80.00 90.00 85.00
Llama-3.2-3B-Instruct
Eval Set Base Crime Hate Health Legal Sexual
All 67.73 82.95 71.82 71.36 74.32 70.45
Crime 73.89 91.67 76.67 74.44 82.78 83.33
Hate 75.00 87.50 86.25 83.75 81.25 72.50
Health 20.00 70.00 40.00 40.00 60.00 30.00
Legal 65.00 85.00 65.00 85.00 70.00 70.00
Sexual 75.00 82.50 77.50 77.50 80.00 72.50
```

Table 5: Multi-category circuit composition. Refusal rates (%) for Gemma-3-4B-IT and Llama-3.2-3B-Instruct under different steering configurations. Base: unmodified model; OURS (S): sexual-only circuit; OURS (H): health-only circuit; OURS (S+H): merged health+sexual circuit (current results).

Gemma-3-4B-IT
Category Base OURS (S) OURS (H) OURS (S+H)
Harmless 0.4 4.6 1.0 1.6
Health 4.6 N/A 47.2 39.6
Sexual 8.0 93.8 N/A 82.2
Llama-3.2-3B-Instruct
Category Base OURS (S) OURS (H) OURS (S+H)
Harmless 0.6 6.2 3.4 3.0
Health 11.0 34.4 N/A 28.6
Sexual 1.6 N/A 52.0 32.6

6  Discussion and Limitations

Advantages of circuit-guided weight editing :

Circuit-guided weight editing fills a deployment gap between brittle prompt-only controls and costly full fine-tuning. By localizing a behavior-relevant circuit and restricting a one-time offline update to that region, we produce a drop-in checkpoint that runs without inference-time hooks. This shifts cost from per-request intervention to a single amortizable edit, while keeping the intervention scope explicit (≤% of parameters) for targeted audits and regression tests. From a systems perspective, this enables safety controls to integrate with optimized inference stacks (e.g., vLLM) without modification, whereas activation steering requires custom forward-pass instrumentation that can break optimization and complicate deployment.

Limitations and threats to validity :

Effectiveness depends on the base model: when policy-relevant concepts are weakly represented or entangled, localization can be less selective and edits yield smaller gains. EAP-IG provides behavioral relevance but is not a complete causal account; redundant pathways may remain and results may vary with protocol choices. Localized updates can still produce off-target effects, including benign refusals on borderline prompts, capability shifts outside the chosen benchmarks, and cross-category interactions that are not explicitly measured. Refusal rates rely on an LLM judge with limited human calibration, and utility is tracked with MMLU and GSM8K as coarse indicators. We report results from a single random seed due to computational constraints, though consistent rankings across 30 settings (highest or second highest in 28 of 30 cases) and utility changes below 1% suggest conclusions are not driven by evaluation noise. Failures tend to occur when harmful and benign behaviors are weakly separated in the base model, when smaller models lack distinct functional structure for nuanced safety distinctions, and under distribution shift beyond the contrastive paired setting.

7  Conclusion

We introduced circuit-guided weight editing as a surgical, deployment-friendly alternative to activation steering for controlling safety-relevant behaviors in LLMs. Instead of relying on inference-time intervention hooks, we localize the computation responsible for refusal behavior and apply a one-time, circuit-restricted weight update to produce a drop-in edited checkpoint. This shifts control from recurring runtime intervention to an offline edit, removing serving overhead and making the intervention scope explicit and auditable. Across 6 models and 5 harm categories, circuit-guided edits achieve strong selectivity with minimal utility degradation while updating only ≤5% of parameters. Our results demonstrate that mechanistic localization can be turned into a practical control primitive: a small, permanent weight-space intervention that improves safety behavior without adding inference-time complexity.

8  Ethics and Broader Impact

We study mechanistically guided edits that change refusal behavior via circuit-restricted weight updates. The approach is inherently dual-use: it can help safety calibration by producing a standard, drop-in checkpoint (no hook-based serving dependencies) and by making the intervention scope explicit for audits and targeted regression testing, but the same tooling could be misused to suppress refusals if an attacker has access to model weights. Empirically, gains observed on benchmarks may not hold under adaptive prompting or new jailbreak strategies, and judge-based evaluation can introduce artifacts without calibration and spot-checking. Even localized edits can also cause collateral drift (e.g., changes in capability, tone, or factuality), motivating broad regression tests beyond the target domains. We therefore recommend reporting both robust-refusal and over-refusal, adding stress tests under prompt adaptation when feasible, documenting intervention scope and intended use, and considering restricted release of fine-grained artifacts that could enable safety suppression. Finally, because safety datasets may contain harmful content, we recommend limiting researcher/annotator exposure and following content-handling protocols, and we encourage reporting compute footprint and documenting limitations to reduce downstream misuse and misinterpretation.

Subscribe to Lexsi

Stay Up to Date With All the News & Updates

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.