Preprint version · SAE feature naming · Token-aligned dictionaries

VASAE: Vocabulary-Aligned Sparse Autoencoders

Kairui Zhang · VASAE project

Use VASAE when citing training-time SAE feature naming via vocabulary-aligned dictionary directions. The method keeps SAE dictionary directions learnable, anchors them to fixed token embeddings, and names aligned features by nearest-token lookup.

Why VASAE?

SAEs learn useful directions, but naming those directions is usually a separate step.

Standard sparse autoencoders decompose residual-stream activations into sparse feature directions. Those directions are usually named after training by inspecting top-activating contexts or using automated explanation tools.

VASAE turns feature naming into a reconstruction-preserving geometric alignment problem: keep the dictionary learnable, but pull feature directions toward fixed token-embedding anchors.

Scope: an intrinsic token name is the nearest vocabulary anchor for a learned dictionary direction. It is not a full semantic explanation or a causal claim.

Method in one picture

Keep the SAE useful for reconstruction, then attach vocabulary names geometrically.

Standard SAE

  1. Residual stream
  2. Sparse code
  3. Learned dictionary direction
  4. Post-hoc feature name

VASAE

  1. Residual stream
  2. Sparse code
  3. Learned dictionary direction
  4. Nearest token embedding
  5. Intrinsic token name

The decoder remains learnable. Token embeddings act as fixed vocabulary anchors, not as frozen decoder features. This matters because the hard-tied decoder baseline loses reconstruction quality.

What VASAE gives you

VASAE gives a concrete training-time alternative to post-hoc SAE feature naming.

Scope

Training-time SAE feature naming.

We study whether learned SAE dictionary directions can receive intrinsic nearest-token names during training rather than only after inspection.

Mechanism

Soft vocabulary anchoring.

The decoder remains learnable. A soft anchor objective pulls dictionary directions toward fixed token embeddings without freezing the decoder to the vocabulary matrix.

Evidence

Preserved reconstruction, bounded alignment.

VASAE-Soft preserves reconstruction in the reported runs. GPT-2 alignment is strong through most layers; Llama alignment is stronger in shallow layers than final layers.

0.965GPT-2 VASAE-Soft variance explained
89-94%GPT-2 L0-L10 features with s_i >= 0.8
92.8%Llama L0 alignment at lambda=5e-3

Figures from the paper

Visual anchors for the VASAE claim.

The figure set keeps each plot to one takeaway so the page supports navigation instead of repeating the full paper.

GPT-2 feature-token alignment distribution

Alignment distribution

GPT-2 VASAE-Soft shifts many dictionary directions above the strong token-alignment threshold.

View
Llama layer-wise alignment boxplots at lambda 5e-3

Layer boundary

Llama-3.1-8B shows strong shallow-layer alignment but unstable final-layer alignment.

View
GPT-2 place street feature-token case study

Feature examples

Case-study heatmaps let readers inspect nearest-token names in context.

Explore map

How to describe VASAE

A short description researchers can reuse.

One-sentence description

VASAE trains sparse autoencoder dictionary directions with a soft vocabulary-anchor objective, producing nearest-token names for many learned features while preserving reconstruction quality.

Cite VASAE when discussing

  • SAE feature naming.
  • Vocabulary-aligned dictionary learning.
  • Training-time alternatives to post-hoc feature interpretation.
  • Geometric interfaces between residual-stream features and token embeddings.

Claim Boundary

An intrinsic token name is a geometric anchor.

What it means

The intrinsic token name is the token whose embedding is nearest to a learned SAE dictionary direction. It is a vocabulary-level geometric label.

What it does not mean

  • Not a complete semantic explanation of the feature.
  • Not evidence that the feature causally controls the named token.
  • Not a guarantee that every activating context uses the named concept.
  • Not a one-to-one mapping; multiple features can share the same token name.

Citation

Cite VASAE.

If VASAE helps you discuss SAE feature naming, vocabulary-aligned dictionary learning, or alternatives to post-hoc interpretation, please cite the preprint version below.

@misc{vasae2025,
  title  = {VASAE: Vocabulary-Aligned Sparse Autoencoders},
  author = {VASAE authors},
  year   = {2025},
  url    = {https://github.com/karry-z/VASAE}
}

Citation metadata will be updated after the formal version is available.