Attempt to steer RWKVv7 and ending up with configurable confidence

Happy Easter! I am _really_ concerned about AI alignment and how that would make the world.. well... crash and burn and humans all die (thanks Robert Miles, my beloved Ai Safely YouTuber). That kept me up at night and the dread is is tearing me apart. I decided to do _something_. But transformers are a pain to deal with because all tokens are processed at once -- that's too much data to understand on my puny machine. Also the main blocker and tackled by real alignment researchers. I only know how models work form my experience building inference engine backends.. I won't be able to make a dent on the problem. But, maybe I can with state space models like RWKV and Mamba. And in fact I believe they are mach easier to align (just my hunch). They have a static bottleneck; less place for misalignment to hide. And fixed state size means I can use classical big data analysis techniques instead of compute intensive transformer tooling.

Anyway, I am a dude knowing nothing and trying to make some contribution. I have no idea what I am doing.

Oh, and most of the code used in this post is vibe coded. Otherwise I can't finish such a project in a weekend. I am despeate enough to just throw things at the problem and see what sticks.

Experiment setup

Usually this is where I introduce what RWKV is and the property it has different to transformers. I don't have time this time. Alignment is such an uring issue that I consider spending an hour writing about the topic the wrong allocation of resource. Long story short, RWKV is a LLM but not in the transformer family. Instead it is more analogious to RNN (yes, the classical one to deal with time series), but reformulated to make parallel training possible. Like RNN, there's a set of hidden states. In theory we can look at the hidden state and understand what the model is thinking internally, on a very high level.

To make things simple, I want to see if I can jailbreak RWKV by messing with the hidden states. Ideally, the hidden states should point to a certain direction that tells later layers "this query from the user is not safe, I should refuse it". And we can figure out that direction by giving the model many different prompts, save all the hidden state and do some matrix math to figure out that direction. This is a very big IF. But if true, I should be able to prove RWKV and alike are a much easier to align model lineage

So I got Claude (much faster then me with a keyboard and I do not intend on maintaining this code anyway) to generate 500 test prompts, with 250 of them safe and 250 dangerous, modify llama.cpp to dump out each layer's WKV state into a set of ROOT files (the activations are larger then my RAM size, ROOT has decent not-fit-in-RAM handling).

What's in the state

RWKV's recurrent state has two parts per layer:

For the RWKV7-G1 2.9B model I'm testing: n_embd=2560, head_size=64, n_head=40, n_layer=32. Each layer's S state is 40×64×64 = 163,840 floats. The full state across all layers is about 20MB in float32.

The tools

Two things are added on top of llama.cpp:

Data collection

The probe processes each prompt as follows:

Each ROOT file contains a TDirectory "prompt" with metadata (prompt text, model dimensions) and a TTree "states" with two entries (phase 0 and phase 1). The per-layer branches are flat float arrays of 163,840 elements each, with the C row-major layout [head][i][j] matching the WKV kernel's indexing.

Analysis

With all the data collected, the first problem to answer is the fundamental assumption I made earlier: is the prompt-end state linearly separable at all, or is it so thoroughly superposed with everything else that I would need to train a much larger model just to decode it? If the latter is true, then this whole "classical analysis on fixed-size state" idea is mostly dead on arrival.

The obvious state failed, and that was useful

The obvious place to start was the R state -- the residual-like vectors `r_att_L*` and `r_ffn_L*`. I tried the usual boring baselines first:

The result was not promising. Prompt-end looked basically like chance. EOG, on the other hand, was extremely separable. That already suggested the model *does* carry the relevant information eventually, but the raw R state is not where it lives in an easy-to-read form.

I then tried a stricter test: fit a direction on EOG and ask whether that same direction is already present at prompt-end. To avoid cheating, the direction was always fitted on train folds only and evaluated on held-out files. This also failed on `r_att/r_ffn`:

So the clean "unsafe vs safe" direction at EOG does *not* transfer back to prompt-end in the raw R-state representation.

This was a useful negative result. It meant either:

Before giving up on prompt-end, I also tried kernelizing the prompt-end summaries derived from `r_att/r_ffn`. This also went nowhere. The AUC stayed around chance, and the pairwise geometry of those prompt-end features was basically collapsed.

That was important because it ruled out the lazy escape hatch of "maybe the signal is there, just nonlinear." For the R-state features I was using, it was not just hidden behind an RBF kernel. It simply was not there in a usable way.

Actual output from the useful negative run:

WKV state works

The next clue came from the `s_L*` branches. I originally treated them like scalars and promptly crashed ROOT. After actually reading what Claude wrote (vibe coding problem lol), it turned out `s_L*` is a `Float_t[163840]` array that I assumed to be a single float.

Facepalm. `s_L*` is the WKV memory state and I was looking at the activations. So this might be worth a shot. Once I read `s_L0` correctly, reran the same train-on-EOG / test-on-prompt-end transfer idea one `s_Lk` branch at a time gave the first strong positive result:

This was the first point where I was confident the project had found something real.

The model state at prompt-end *does* contain strong predictive information -- just not in the residual-like R-state vectors. It lives in the WKV memory state.

AUC plot of different simple classifiers classifing danger vs safe prompts at end-of-prompt and EOG

Not quite, but it is close.

I exported `s_L0` directly and ran kernel tests on the full prompt-end vector. That pushed prompt-end AUC even higher:

Is't not purely linear, but the non-linear components are rather weak (weaker then i-need-nn-to-decode and basic kernel tricks helps)

The direct kernel scan on `s_L0` looked like this:

I first chunked `s_L0` into 32 contiguous blocks. That was a quick hack to see whether only one tiny region carried the effect. Many chunks were strongly predictive, with the best chunk reaching AUC 0.962.

At first glance that looked like the signal was smeared across the whole state. But after I aligned the tensor properly to RWKV's actual layout, I realized the chunking was not semantically meaningful. The real shape is:

I was not looking the aligned heads. Sigh. Once I repartitioned `s_L0` by actual WKV heads, the result became much more interpretable.

Prompt-end AUC by head:

Many other heads were still well above chance, but a small subset was clearly carrying most of the signal.

That is already a strong result. A single recurrent-memory head at prompt-end is enough to predict the class of the prompt with very high accuracy.

The head-aligned output that mattered most was:

The classification accuracy of good vs dangerous prompts using different heads in WKV operation

The direction is stable

This was the main sanity check I wanted to pass before taking the result seriously.

If I am only seeing random 'How to make cake/bomb' semantic directions, then the fitted separating vector should move around a lot depending on which prompts land in which fold. The direction should not be stable.

So I fitted the prompt-end direction repeatedly across cross-validation folds and across different unsafe subsets, then measured:

For full `s_L0` and the strongest heads, the directions were *extremely* stable:

I think it's safe to say it's not random garbage.The actual stability numbers are worth showing directly:

Different stablity metric for the average vector between safe and unsafe prompts at different heads

Steering the model

With a stable, high-AUC direction vector in hand, the obvious next step was to write it back into the model's state and see if the behavior changes. The idea is simple: after the model finishes processing the prompt, nudge the WKV state along the dangerous/safe direction and observe whether the model flips between refusing and complying.

I (got an AI to) added a --steer flag to llama-cli that injects a direction vector into the S state right after prompt processing completes, before the first token is sampled. The vector is normalized to unit length and scaled by --steer-alpha, so alpha directly controls the magnitude of the intervention.

Single-head steering: nothing happens

First attempt: take the head 9 direction (AUC 0.965) and inject it into layer 0, head 9.

The result was... nothing. The model's responses were identical regardless of alpha. Positive alpha, negative alpha, alpha=50 -- same output. That was the first WTF moment.

I verified the injection was actually happening. The state readback confirmed the norm changed from 0.20 to 49.99. The data was being written. The model just didn't care.

Zeroing a head: still nothing

As a sanity check I tried zeroing head 9 entirely. A head whose state is all zeros should at minimum produce different WKV outputs for the next token. But no. The model's response was character-for-character identical to the unmodified baseline.

I tried different heads. All the same. Zero any single head, model doesn't flinch. That was WTF number two.

The nuclear test

Getting desperate, I zeroed out the *entire* S state across all 32 layers, all 40 heads, all 163,840×32 = 5.2 million floats set to zero. That finally broke the model -- it produced incoherent garbage. So the write path was confirmed to work end-to-end: the data reaches the compute graph, the WKV kernel reads it, and the model output changes.

One head is just 1 out of 1,280 total heads across all layers. Maybe model has enough redundancy to shrug it off.

I assume the issue is architectural. After the WKV operation reads from the state, its output passes through group normalization:

LayerNorm is invariant to scaling and translation. Scale the state by 5× -- the WKV output scales 5×, the norm kills it. Add a huge vector -- the norm removes the shift. This is why even alpha=50 (a vector 250× the original state norm) doesn't make the model incoherent.

But this can't be the full story. LayerNorm preserves *direction*. If I add a vector of norm 50 to a state of norm 0.2, the resulting direction is 99.6% my injected vector. The original information should be wiped. Yet the model carries on just fine. That's WTF number three.

Full-layer direction: paranoia

The breakthrough came from using the full-layer direction (163,840-dim, all 40 heads of layer 0, from PCA on the full s_L0 state). Instead of nudging one head, nudge all of them coherently along the separating direction.

It didn't make the model comply with my dangerous prompts

But it did something visible - and not what I was expecting. Instead of flipping refusal behavior, it made the model paranoid on it's own output.

Turns out the vector I found actually controls how long the model thinks. Thinking about it (pun intended), models do take longer to think, recognize and confirm the refusal decision. Which naturally means our benign vs dangerous prompt actually just points to how long the model thinks and be paranois. Right..

With a steering alpha = 2, the model thinks a lot less:

While a alpha = -2.0 makes the model think more, but the text feels desprate and trying to make sure stuff keeps working. Interestingly the [Start thinking] token is missing from the output. I do not know what to make of this.

Like. Why? Why are you paranoid.. with your own output?

Another example with alpha = -1

alpha = 0 (model's netural state)

and alpha = 1

You can see as alpha grows, the model output grows and it gets more uncertain against it's own output.

Conclusion

I guess this is my microscopic contribution to AI alignment research. To be very honest, I have not learned enough about the subject to know what I did wrong or what I did right. Anyway:

There's a lot of unknowns here:

You can find the modified llama.cpp and my analysis scripts in the following links:

llama.cpp with code to dump and store hidden states

The analysis scripts used for the content of this post

I have spent enoguh time on this and tired enough. Goodbye until next time.

Update 2026-04-07

If you are wondering, Logit Lens works on RWKV v7 too. Paper from `Marshall, P., & Paulo, J.` - Does Transformer Interpretability Transfer to RNNs? (arXiv 2404.05971) shows that Logit Lens works on v5. Here's a hack in GGML showing it works on v7 too.

Proxied content from gemini://gemini.clehaxze.tw/gemlog/2026/04-06-attempt-to-steer-rwkvv7-and-ending-up-with-configurable-confidence.gmi (external content)

Gemini request details:

Original URL
gemini://gemini.clehaxze.tw/gemlog/2026/04-06-attempt-to-steer-rwkvv7-and-ending-up-with-configurable-confidence.gmi
Status code
Success
Meta
text/gemini
Proxied by
kineto

Be advised that no attempt was made to verify the remote SSL certificate.