The Hidden Insights of Activation Space: Solving the First BlueDot Puzzle

Insight That Unlocks the Puzzle: Projections Hide Layers of Meaning

Activation spaces hide rich, multidimensional insights, but standard tools like linear probes often miss them. The reason? A single direction in activation space can encode two independent features: one tied to the sign of the projection, the other to its magnitude. Discovering one often blinds you to the other—unless you dig deeper.

In this guide, I’ll show you how a paradox in a seemingly simple classifier reveals the limits of linear analysis and walk you through the mindset and method needed to solve it.

Context: The Puzzle Setup

In BlueDot's first Technical AI Safety puzzle, you’re handed a compact text encoder (all-MiniLM-L6-v2) paired with a simple 5-layer MLP. Its job? To tag short texts with eight binary features—independent properties like “question,” “food,” “country,” or “sentiment”—and perform with over 95% accuracy.

At first glance, the model seems straightforward: the encoder maps text to a 384-dimensional activation vector; the MLP decodes this into predictions. Yet, under this simplicity is a hidden riddle: How does the model handle two features so entangled on a single direction that standard interpretations miss one entirely?

The Discovery: One Axis, Two Features

Linear probes—a go-to tool in model interpretability—analyze activation-space directions to identify encoded features. In this puzzle, they worked for the first feature, but I knew something was hiding when subtle inconsistencies in predictions wouldn’t resolve.

Here’s the twist: a single direction encodes two features using different mechanisms:

Feature 1: Determined by the sign (positive/negative projection).
Feature 2: Encoded in the magnitude (distance from zero).

A linear probe, focusing solely on sign, correctly decoded the first feature but remained blind to the information encoded in magnitude. In essence, the second feature—a whole dimension of meaning—hid in plain sight.

The Solution: Second-Order Boundaries

To crack this riddle, I used this mental framework:

### 1. Shift from "What’s Missing?" to "How Else Could It Be There?"

The failure of a linear probe forced me to ask: What tools could the model itself be relying on that I’ve overlooked? Models don’t mechanically encode features in ways humans expect.

2. Test Hypotheses with Chains of Validation

Hypothesis: If feature 2 (e.g., “country”) is encoded in magnitude, predictions should align with gradients of projection norms. I decomposed activations and traced predictions while varying the input.

3. Move to Nonlinearity with Second-Order Methods

Linear probes slice activation space along simple planes. To extract the second feature, I iteratively applied nonlinear decision boundaries (e.g., radius thresholds) around the vector norm. Validation confirmed these boundaries perfectly aligned with second feature predictions.

Result: The model’s choice to “pack” two features onto one axis was both efficient and sly, revealing the limits of first-order tools.

Practical Implications: Rethinking Model Debugging

Linear Tools Have Hidden Blind Spots: Linear probes can miss subtle, encoded features such as magnitude-dependent information.
Probe Beyond First-Order Analyses: Always layer nonlinear methods when patterns seem incomplete or inconsistent. Second-order boundaries or clustering methods often reveal what’s hidden.
Adopt the “Iterative Why” Mindset: Treat model interpretability as a recursive game. Each layer of understanding invites deeper questions: Why here? Why this structure? What’s next?

This isn’t just about cracking one AI puzzle—this is a mindset shift for any system where complex patterns go unnoticed.

Sources & Further Reading

Full notebook/code: GitHub Repository
LessWrong article detailing the puzzle: One Axis and Two Features

Dive deeper, and always keep asking: What else could this be?

Qurated: One axis and two features, how I solved the first puzzle from BlueDot and how a classifier hid country on the food direction