Watching a Language Model Think in Real Time

Two days ago I had never heard of mechanistic interpretability. I vaguely knew that people were trying to understand what happens inside neural networks, but I couldn't have told you what a Sparse Autoencoder was, what a residual stream was, or why anyone would care about either.

Then I fell down the rabbit hole. I read Anthropic's Scaling Monosemanticity paper, then their Circuit Tracing work, then Google's Gemma Scope release. The basic idea clicked fast: a language model's internal state at any given layer is a dense vector of thousands of floating-point numbers that doesn't mean anything to a human. But you can train a Sparse Autoencoder (SAE) to decompose that dense vector into a sparse set of interpretable "features" — individual concepts the model has learned. A feature might correspond to "geography or place names" or "greeting or salutation." When the model generates the token "Paris," geography features light up. When it writes "Hello!", greeting features activate instead.

What surprised me was that Google had already done the hard part. They released over 400 pre-trained SAEs covering every layer of their open Gemma 2 models. But there was no easy way to actually use them during inference. You could download the weights, but then what? You'd need to wire them into an inference engine, hook into the model's forward pass at the right layer, apply the correct normalization, figure out JumpReLU thresholds, and somehow label 16,384 latent features with human-readable descriptions.

I learn by building things, so I built Neuroscope. I'm very new to this field and probably getting some things wrong — but it works, and building it taught me more about transformer internals in two days than I'd learned in the previous year of using LLMs.

What It Does

Neuroscope runs Gemma 2 2B locally using mistral.rs for inference, hooks into layer 20 of the transformer, runs the residual stream through a Gemma Scope SAE encoder, and streams the top activated features in real time via SSE. It exposes two APIs — a standard OpenAI-compatible chat endpoint on port 8080, and a features stream on port 8081:

 Chat API (:8080)                Features API (:8081)
 POST /v1/chat/completions       GET /v1/features/stream → SSE
 GET  /v1/models                 GET /v1/features/labels → JSON
        │                                ▲
        ▼                                │
 ┌───────────────────────────────────────┤
 │  Inference Engine (mistral.rs)        │
 │                                       │
 │  Transformer layer 20 ──hook──► SAE Encoder
 │                                       │
 │  Token output ──────────────► broadcast channel
 └───────────────────────────────────────┘

You send a chat message, and while the model streams back its response, the features endpoint streams what the model is "thinking about" for each token:

{
  "token_index": 0,
  "token": "Paris",
  "layer": 20,
  "top_features": [
    {"index": 4521, "label": "geography or place names", "activation": 3.82},
    {"index": 12033, "label": "European countries and capitals", "activation": 2.14}
  ]
}

Because the chat API is OpenAI-compatible, you can point any existing tool at it — Continue, Open WebUI, plain curl — and it works as a normal LLM endpoint. The features stream is a separate concern that you consume independently. A terminal logger, a web visualization, both at once — whatever you want.

Why I Built It

I learn best by building. Reading papers gives me the concepts, but I don't really understand something until I've written code that makes it work. When I saw that Google had released hundreds of pre-trained SAEs but the only way to use them was research notebooks and custom Python scripts, it felt like the perfect project — a clear gap between "the hard science is done" and "anyone can actually use this."

The hardest part — training the SAE — was already done. The second hardest part — running inference — has excellent open-source solutions like mistral.rs. What was missing was the glue: hooking the two together, handling the normalization, filtering noise, labeling features, and wrapping it all in something you can just run.

I also wanted something that ran on a laptop. Gemma 2 2B is small enough to run on a MacBook Air with Metal GPU. The SAE encoder is a single matrix multiply — [2304 × 16384], about 75 million FLOPs — which is trivial compared to the model's forward pass. The whole thing adds almost no overhead.

The Interesting Parts

Hooking Into the Forward Pass

The trickiest architectural decision was how to intercept activations mid-inference. mistral.rs doesn't natively support activation hooks, so I vendored it and patched the Gemma 2 model to call an ActivationHook trait after each transformer layer's forward pass. The hook is an Arc<dyn ActivationHook> stored on the model — one line of code in the forward loop, a vtable dispatch per layer, negligible overhead.

The neuroscope engine then implements SaeHook, which catches layer 20, grabs the hidden state tensor, runs it through the SAE encoder, and publishes the top-K features to a Tokio broadcast channel. The SSE server subscribes to that channel. Clean separation: inference doesn't know about HTTP, the web server doesn't know about tensors.

Device Alignment on Metal

This one cost me a few hours. On macOS Metal, Candle uses pointer equality to compare Device instances. If you construct two Device::Metal values that both point at the same physical GPU, Candle treats them as different devices and matmul panics. The fix is simple but non-obvious: you have to extract the Device from the loaded model pipeline and load SAE weights onto that exact same instance. Not a different instance of the same device — the same Rust value.

Filtering Noisy Features

Raw SAE output is noisy. Some features fire on 80-90% of all tokens — they're effectively always on and represent something generic like "the model is generating text" rather than anything token-specific. These dominate the top-K list and drown out the interesting stuff.

Neuroscope runs a calibration pass over 1,000 WikiText samples to compute per-feature statistics (firing rate, mean activation, variance), then applies a two-stage filter:

Frequency filter: exclude features that fire on more than 50% of tokens
Surprise ranking: rank remaining features by how unexpected their activation is (z-score relative to calibration statistics)

A feature that always fires at activation 50 but suddenly spikes to 200 is interesting. A feature that always fires at 50 is not. This combined approach — remove the background, then rank by surprise — produces much cleaner output than raw top-K.

Labeling 16,384 Features

An SAE with 16K features isn't very useful if feature #4521 is just "feature_4521." You need human-readable labels. Neuroscope generates these using auto-interp — an approach from Bills et al. that was later refined by Anthropic and EleutherAI.

The pipeline:

Corpus pass (~75 min): Run 5,000 WikiText samples through the model with the SAE hook active. For each of the 16,384 features, collect the top 20 tokens where that feature activated most strongly, along with surrounding context.
Label generation (~30 min): Send each feature's max-activating examples to an LLM (DeepSeek V3.2 via OpenRouter by default, but you can swap in Claude or GPT-4o) and ask it to describe what pattern the examples share. Each label is cached as it completes, and the whole process checkpoints every 50 samples so it resumes on interrupt.
Scoring (optional): Use EleutherAI's detection method — show the labeler model a mix of activating and non-activating text, ask it to predict which is which using only the label, and compute balanced accuracy. Labels below 60% get flagged as unreliable.

The label pipeline supports multiple labelers with namespacing, so you can generate labels with DeepSeek and Claude side-by-side and compare quality.

If you don't want to spend 2 hours generating your own labels, neuroscope pull downloads pre-computed labels and calibration data from HuggingFace in about a minute.

What I Learned

Start with the spec. I wrote a detailed spec before writing any code — architecture diagrams, trait definitions, test plans, even the SSE event schemas. This paid off massively. When I hit the Metal device alignment issue, I knew exactly where in the architecture the fix belonged. When I needed to add calibration filtering, the spec already defined how it would plug into the existing hook pipeline.

Vendor and patch. I initially tried to avoid modifying mistral.rs, but there's no way to intercept activations mid-forward-pass from outside. Vendoring the dependency and adding a 5-line hook trait was far simpler than building a custom inference engine or trying to make it work through the existing API. The changes are minimal and could be upstreamed.

SAE normalization matters a lot. Gemma Scope SAEs were trained on RMS-normalized inputs. Feeding raw hidden states produces wildly inflated activations that don't match the training distribution. One line of normalization — divide by root mean square — makes everything work correctly. This is documented in Google's paper but easy to miss.

The Architecture

The project is about 5,000 lines of Rust across four crates:

Crate	Purpose
`neuroscope-core`	SAE encoder, feature types, labels, calibration, filtering, auto-interp
`neuroscope-engine`	mistral.rs wrapper, device alignment, hook wiring
`neuroscope-server`	Axum HTTP servers for chat API + features SSE
`neuroscope-cli`	CLI orchestration — 15+ subcommands

The split is intentional. neuroscope-core is a standalone interpretability library with no inference engine dependency. neuroscope-engine encapsulates mistral.rs (you could swap it for vLLM or llama.cpp). neuroscope-server is pure HTTP logic testable without loading a 5 GB model. Unit tests run in milliseconds; integration tests that need real model weights are gated behind an env var.

What's Next

Right now Neuroscope observes a single layer. I'm still learning, but the roadmap has four phases that I think make sense (corrections welcome):

Multi-layer observation: Instrument several layers simultaneously and show how features evolve as information flows through the network. This is mostly plumbing — the hook infrastructure already supports it.
Feature steering: Load the SAE decoder weights and use them to clamp or modify feature activations mid-inference. Want to amplify the "formal tone" feature and suppress "casual language"? Inject the decoder vector back into the residual stream. This turns observation into intervention.
Causal tracing: Use weight-space analysis to compute causal influence between features across layers. The encoder of one layer dotted with the decoder of another gives you a complete causal map — which features in layer 18 excite or inhibit which features in layer 20. One matrix multiply, no forward passes needed.
Activation patching: The full circuit analysis — run the model twice (clean and corrupted), patch specific feature activations between runs, measure the effect on output. This is the gold standard for mechanistic interpretability but requires the most machinery.

I'm sure my understanding of some of this will evolve as I dig deeper. But the general direction feels right: go from "watch the model think" to "understand why it thinks what it thinks" — and eventually, to intervene in a targeted way.

Try It

# Build (macOS)
cargo build --release -p neuroscope-cli --features metal

# Pull pre-computed labels and calibration (~1 min)
neuroscope pull

# Run
neuroscope serve

Then curl -N http://localhost:8081/v1/features/stream in one terminal and send a chat request in another. Source is on GitHub, MIT licensed.