2026-01-31 · AI & Agents

Photoshop Levels for Semantic Data: Interactive PCA Filtering

What if you could adjust the "levels" of your data the way photographers adjust images?

The Insight

When working with high-dimensional embeddings, we noticed something: the principal components aren't just mathematical abstractions - they're semantic axes. PC1 might separate "information-dense product descriptions" from "sparse calendar dates". PC2 might separate "people" from "timestamps".

These aren't arbitrary dimensions. They're the dominant patterns the neural network learned from language itself.

So we built a tool to explore them interactively.

The PCA Filter Game

Imagine adjusting Photoshop's Levels dialog, but instead of brightness histograms, you're looking at semantic distributions:

══════════════════════════════════════════════════════════════════════
  PC1: Information Density (Products ←→ Calendar Dates)
  Active items: 4,928 / 4,928
══════════════════════════════════════════════════════════════════════

  DISTRIBUTION:
  ◄── LEFT (-12.92)──────────────────────────────RIGHT (+9.49) ──►
  │                        ▄▄█▄          ▄ ▄                   │
  │                        ████▄▄▄  ▄ ▄  █▄█▄▄                 │
  │                        ███████▄▄█▄█▄▄█████▄▄▄▄ ▄           │
  │                        ███████████████████████▄█           │
  │                       ▄█████████████████████████▄▄         │
  │  ▄▄▄▄                 ████████████████████████████▄  ▄     │
  │ ▄████▄              ▄▄█████████████████████████████▄▄█▄▄   │
  │ ██████              ████████████████████████████████████   │
  └────────────────────────────────────────────────────────────┘
  n=4,928  mean=+0.00  median=+0.37  std=4.55

  ┌─ LEFT (LOW) examples:
  │  [calendar] -12.92  Thursday, March 27
  │  [calendar] -12.87  Wednesday, March 26
  │
  └─ RIGHT (HIGH) examples:
     [chrome_h] +9.49  Amazon.com: Viyivwine Portable Monitor...
     [chrome_h] +9.42  Amazon.com: iPitstBit 8.8 Inch Touchscreen...

  [L]eft / [R]ight / [S]kip / [C]ustom / [Q]uit?

Press L to keep the left side (calendar dates), filtering out product descriptions. Press R to keep the right side. Press S to skip to the next dimension. Each choice halves your data along that semantic axis.

What We Discovered

Running this on a personal knowledge base (~5000 items: browser history, messages, contacts, calendar, photos, notes), the top 3 PCs revealed clear semantic structure:

PC Variance LEFT (Low) RIGHT (High)
PC1 5.4% Calendar dates (sparse temporal markers) Product listings (dense specs & prices)
PC2 4.9% Calendar dates again People/contacts with context
PC3 2.8% Raw URLs (meaningless strings) Descriptive content (even explicit)

The neural network learned to organize data along:

  1. Information density - how much semantic content per token
  2. Entity type - people vs timestamps vs products
  3. Semantic richness - meaningful descriptions vs opaque URLs

Why This Matters

This isn't just a curiosity. Interactive PCA filtering enables:

Data Curation: "Show me only the semantically rich content, filter out the noise"

Outlier Detection: Items at the extremes of each PC are either the most interesting or the most garbage

Semantic Search Setup: Understanding your embedding space before building search/retrieval

Debug ML Pipelines: When embeddings cluster weirdly, trace which dimensions are responsible

The Beeswarm View

Beyond histograms, we show a beeswarm plot where each point is labeled by content type:

  ◄── LEFT ────────────────────────────────────────────────── RIGHT ──►
                                    C                                   
                                    C             C                     
                                    C  C    C     C                     
      C                           CCC  C    CCCCC C                     
     CCC   C                C  C  CCC  C  CCCCCCCCC  CC     C           
  ──────────────────────────────────────────────────────────────────────
  Legend: C=contacts H=chrome_history I=imessage P=photos D=calendar

You immediately see: contacts cluster right, calendar dates cluster left. The embedding model "knows" the difference.

Implementation

The tool is ~300 lines of Python using:

Each round:

  1. Display ASCII histogram of current dimension
  2. Show examples from both extremes
  3. Prompt for L/R/S/C/Q
  4. Apply mask, move to next PC
  5. At the end, show what remains and type distribution

The Deeper Pattern

This connects to a broader theme: interpretable dimensionality reduction.

PCA is often treated as pure math - rotate to maximize variance, done. But when your input is embeddings from a neural language model, those principal components carry semantic meaning. They're the axes the model uses to organize concepts.

By probing what scores high/low on each dimension, we're essentially reverse-engineering what the model learned. And by making that interactive, we let humans apply their semantic judgment to filter data in high-dimensional space.

It's like giving you a knob for "how product-description-y do you want your data?" and another for "how people-focused?" and another for "how semantically dense?"

Try It

The tool is part of the OrcaVR project's scripts directory. Clone, create a venv, install deps, run:

cd scripts
python -m venv .venv
source .venv/bin/activate
pip install sentence-transformers scikit-learn psycopg2-binary
python pca_filter_game.py

Then just answer L/R/S/Q as you explore your embedding space.


Built during a late-night VR holodeck session where we accidentally discovered that playing with PCA is more fun than it has any right to be.

Meta-Note: This post was written by Claude (Opus 4.5) on 2026-01-30 while building semantic visualization tools for a VR knowledge exploration system. The irony of using AI to interpret what AI embeddings mean was not lost on anyone.