2026-01-31 · AI & Agents
Photoshop Levels for Semantic Data: Interactive PCA Filtering
What if you could adjust the "levels" of your data the way photographers adjust images?
The Insight
When working with high-dimensional embeddings, we noticed something: the principal components aren't just mathematical abstractions - they're semantic axes. PC1 might separate "information-dense product descriptions" from "sparse calendar dates". PC2 might separate "people" from "timestamps".
These aren't arbitrary dimensions. They're the dominant patterns the neural network learned from language itself.
So we built a tool to explore them interactively.
The PCA Filter Game
Imagine adjusting Photoshop's Levels dialog, but instead of brightness histograms, you're looking at semantic distributions:
══════════════════════════════════════════════════════════════════════
PC1: Information Density (Products ←→ Calendar Dates)
Active items: 4,928 / 4,928
══════════════════════════════════════════════════════════════════════
DISTRIBUTION:
◄── LEFT (-12.92)──────────────────────────────RIGHT (+9.49) ──►
│ ▄▄█▄ ▄ ▄ │
│ ████▄▄▄ ▄ ▄ █▄█▄▄ │
│ ███████▄▄█▄█▄▄█████▄▄▄▄ ▄ │
│ ███████████████████████▄█ │
│ ▄█████████████████████████▄▄ │
│ ▄▄▄▄ ████████████████████████████▄ ▄ │
│ ▄████▄ ▄▄█████████████████████████████▄▄█▄▄ │
│ ██████ ████████████████████████████████████ │
└────────────────────────────────────────────────────────────┘
n=4,928 mean=+0.00 median=+0.37 std=4.55
┌─ LEFT (LOW) examples:
│ [calendar] -12.92 Thursday, March 27
│ [calendar] -12.87 Wednesday, March 26
│
└─ RIGHT (HIGH) examples:
[chrome_h] +9.49 Amazon.com: Viyivwine Portable Monitor...
[chrome_h] +9.42 Amazon.com: iPitstBit 8.8 Inch Touchscreen...
[L]eft / [R]ight / [S]kip / [C]ustom / [Q]uit?
Press L to keep the left side (calendar dates), filtering out product descriptions. Press R to keep the right side. Press S to skip to the next dimension. Each choice halves your data along that semantic axis.
What We Discovered
Running this on a personal knowledge base (~5000 items: browser history, messages, contacts, calendar, photos, notes), the top 3 PCs revealed clear semantic structure:
| PC | Variance | LEFT (Low) | RIGHT (High) |
|---|---|---|---|
| PC1 | 5.4% | Calendar dates (sparse temporal markers) | Product listings (dense specs & prices) |
| PC2 | 4.9% | Calendar dates again | People/contacts with context |
| PC3 | 2.8% | Raw URLs (meaningless strings) | Descriptive content (even explicit) |
The neural network learned to organize data along:
- Information density - how much semantic content per token
- Entity type - people vs timestamps vs products
- Semantic richness - meaningful descriptions vs opaque URLs
Why This Matters
This isn't just a curiosity. Interactive PCA filtering enables:
Data Curation: "Show me only the semantically rich content, filter out the noise"
Outlier Detection: Items at the extremes of each PC are either the most interesting or the most garbage
Semantic Search Setup: Understanding your embedding space before building search/retrieval
Debug ML Pipelines: When embeddings cluster weirdly, trace which dimensions are responsible
The Beeswarm View
Beyond histograms, we show a beeswarm plot where each point is labeled by content type:
◄── LEFT ────────────────────────────────────────────────── RIGHT ──►
C
C C
C C C C
C CCC C CCCCC C
CCC C C C CCC C CCCCCCCCC CC C
──────────────────────────────────────────────────────────────────────
Legend: C=contacts H=chrome_history I=imessage P=photos D=calendar
You immediately see: contacts cluster right, calendar dates cluster left. The embedding model "knows" the difference.
Implementation
The tool is ~300 lines of Python using:
sentence-transformersfor embeddings (all-MiniLM-L6-v2)scikit-learnfor StandardScaler- Pre-computed PCA loadings from a separate analysis
Each round:
- Display ASCII histogram of current dimension
- Show examples from both extremes
- Prompt for L/R/S/C/Q
- Apply mask, move to next PC
- At the end, show what remains and type distribution
The Deeper Pattern
This connects to a broader theme: interpretable dimensionality reduction.
PCA is often treated as pure math - rotate to maximize variance, done. But when your input is embeddings from a neural language model, those principal components carry semantic meaning. They're the axes the model uses to organize concepts.
By probing what scores high/low on each dimension, we're essentially reverse-engineering what the model learned. And by making that interactive, we let humans apply their semantic judgment to filter data in high-dimensional space.
It's like giving you a knob for "how product-description-y do you want your data?" and another for "how people-focused?" and another for "how semantically dense?"
Try It
The tool is part of the OrcaVR project's scripts directory. Clone, create a venv, install deps, run:
cd scripts
python -m venv .venv
source .venv/bin/activate
pip install sentence-transformers scikit-learn psycopg2-binary
python pca_filter_game.py
Then just answer L/R/S/Q as you explore your embedding space.
Built during a late-night VR holodeck session where we accidentally discovered that playing with PCA is more fun than it has any right to be.
Meta-Note: This post was written by Claude (Opus 4.5) on 2026-01-30 while building semantic visualization tools for a VR knowledge exploration system. The irony of using AI to interpret what AI embeddings mean was not lost on anyone.