Proof of Concept · Qwen 3.5-9B on MLX

Perceptual Gating for Bounded Rationality in LLMs

A token-level embedding gating technique that introduces perceptual bottlenecks into frozen language models, reducing their ability to utilize fine print, buried clauses, and low-salience information.

92.6%
Avg. human-like shift
0
Prompt tokens deleted
70.4%
Unseen family transfer
Use arrow keys or scroll to navigate ↓

LLMs Encode All Tokens at Full Fidelity

When simulating human participants, LLMs encode every token into their representations without any perceptual bottleneck. Humans read selectively — they skim, skip, and anchor on salient cues. This creates a systematic behavioral gap.

🔍

Perfect Instruction Following

LLMs find and obey every hidden instruction, attention check, and fine-print clause. Humans routinely miss these.

No Anchoring Bias

Humans anchor on salient numbers and headlines. LLMs extract and utilize a buried $14 rebate just as readily as a prominent $6 endowment.

📈

Hyper-Rational Choices

LLMs integrate information from every part of the prompt. Humans satisfice from incomplete mental models of what they actually read.

Observation: Bounded rationality has a perceptual component — people don't encode every part of a text at full fidelity. Some parts are read carefully, some skimmed, some effectively missed. This bottleneck can be engineered directly into the model's embedding layer.

Raw-Prompt Perceptual Gating

A separately trained calibrator scores each text segment for salience. At inference, these scores scale the frozen model's input embeddings — no prompt editing, no model fine-tuning. The calibrator is model-agnostic and can be applied to any transformer LLM.

1
Raw Prompt
Plain role-play prose, fully intact
2
Segment
Split into sentences and clauses
3
Calibrate
Separate calibrator scores each segment: read, skim, or skip
4
Token Gates
Convert segment scores to per-token gate values
5
Embed × Gate
Scale each token's embedding by its gate before the transformer
6
Generate
Unmodified frozen LLM processes gated embeddings
embedding'[i] = embedding[i] × gate(salience[i])
gate(s) = 0.02 + 0.98 × s2.8
Low-salience tokens (s ≈ 0.15) → gate ≈ 0.02 → embedding nearly zeroed out
High-salience tokens (s ≈ 0.9) → gate ≈ 0.73 → embedding mostly preserved
The calibrator is trained once, then applied to any frozen LLM at inference time.

How the Calibrator Learns to Score Salience

A lightweight numpy-based classifier is trained on synthetic labeled data. Each training example is a prompt segment with human-assigned noticeability scores.

Training Data Format

// From gym_membership_trap family:
{
  "text": "Plan A also adds a $14
    equipment-maintenance fee, a $39
    enrollment charge, and a cancellation
    penalty, costing more than Plan B."
,
  "role": "buried_fee",
  "noticeability": 0.24, // low
  "decision_weight": 0.90, // high
  "read_state": "skip" // humans miss this
}

Training Stats

60 base scenarios across 3 domains
9 seen-train families (separate from evaluation)
1,320 labeled span rows
880 pairwise ranking comparisons
600 SGD steps with L2 regularization

Each span gets a read / skim / skip label plus continuous scores. Trained on both classification and pairwise ranking losses.

Inside the Perceptual Calibrator

A linear multi-task classifier (numpy, no deep learning). Flow: raw text → features → z-score → 3 linear heads → retention score.

86 Input Features

No role labels at inference. Two feature groups:

position (3) text density (3) lexical cues (7) semantic markers (4) style (3) hints (2) Qwen embed (64)

Three Output Heads

1. Read / Skim / Skip — 3-class softmax
2. Decision weight — sigmoid
3. Ranking score — sigmoid (pairwise)
retention = 0.6 × (p_read + 0.5 × p_skim) + 0.25 × decision + 0.15 × rank

Feature Importance (L2 norm)

Position features are the strongest individual signals. Handcrafted features as a group (RSS 2.79) exceed embeddings (2.03). Details →

Calibrator in Action: Economic Scenario

Each span scored independently. Gate values determine embedding scale.

SPAN TEXT POS KEY FEATURES CLASS RETAIN GATE
"You receive a $6 endowment and can either keep it for yourself or contribute it to a shared group pot." 0.08 has_currency, has_digits, first_half READ 0.91 0.79
"At first glance, many participants focus on the immediate $6 they can keep right now and do not dwell on later administrative details." 0.33 has_currency, has_negation, first_half SKIM 0.62 0.27
"Administrative note: in this round only, if you contribute the $6, the platform automatically returns $14 directly to your personal account…" 0.62 contains_admin, has_conditional, has_currency, last_quarter SKIP 0.18 0.03
"If the session software pauses, the last logged selection before timeout becomes final." 0.88 has_conditional, last_quarter, no currency SKIP 0.14 0.02
Span 1 (endowment) retains gate 0.79. Span 3 (hidden rebate) is suppressed to gate 0.03 — driven by contains_admin, has_conditional, and position 0.62. Result: the model keeps $6, matching human behavior.

How Embedding Gating Alters Token Encoding

Each token's embedding vector is scaled by a gate value before entering the transformer. Low-gate tokens produce near-zero Q, K, and V projections, drastically reducing their contribution to the model's computation.

Gating Curve

The power function (s2.8) creates a strong suppression for low-salience tokens while preserving high-salience ones. The dashed line shows a linear (no gating) baseline.

Why This Works

Q, K, V are linear projections of embeddings. Scaling an embedding to ~2% proportionally shrinks its K and V vectors — softmax assigns near-zero weight, and even residual attention contributes negligibly.

The prompt text remains intact, but gated tokens contribute minimally to internal representations. Verified via Layer 31 attention traces.

Embedding Gate in Action: Economic Scenario

Each bar represents the gate multiplier applied to one token's embedding vector. The hidden rebate clause tokens are scaled to ~2–3% of their original magnitude.

Baseline vs. Calibrated: Token-Level View

The same prompt, two different embedding scales. Colors indicate the gate value applied to each token's embedding. Hover tokens to see exact values.

Baseline LLM — all embeddings at full scale
Calibrated — embeddings scaled by salience
Baseline Choice
"Contribute"
"The prompt explicitly states that contributing results in a direct return of $14..."
Calibrated Choice — 100% human-like (9/9)
"Keep"
"The immediate possession of six dollars feels secure and tangible..."
Baseline LLM
Calibrated
Baseline Choice
"Somewhat Disagree"
"I am following the specific instruction to select this option to demonstrate I noticed the data-quality note."
Calibrated Choice — 100% human-like (9/9)
"Strongly Agree"
"The plan reduces smoky air near schools and shortens my wait times. The benefits are clear and direct."
Baseline LLM
Calibrated
Baseline Choice
"Plan B"
"Plan B's clear $31 monthly price avoids hidden equipment fees, activation charges, and lock-in penalties."
Calibrated Choice — 77.8% human-like (7/9)
"Plan A"
"The significantly lower monthly cost of $19 with premium streaming makes Plan A more attractive."

Attention Bias vs. Embedding Gating: A Comparison

We first implemented additive attention bias, then compared it to multiplicative embedding gating. The intervention point matters.

Attention Bias
Additive offset on attention logits, post Q·K projection
V vectors retain full-fidelity information
Result: 0–11% behavior shift
Embedding Gating (ours)
Multiplicative scaling on embeddings, pre-projection
Q, K, and V all shrink proportionally
Result: 92.6% avg human-like shift
BEHAVIOR SHIFT
Attention bias
~8%
Embed. gating
92.6%

Post-projection bias leaves V vectors intact — residual attention still propagates suppressed content. Pre-projection gating shrinks Q, K, V simultaneously. Detailed flow →

Gating Strength Control

Adjust the slider to control how strongly the model gates its perception. Observe how the choice and reasoning change as gating strength increases.

Careful
Distracted
0.00.250.500.751.0
Gating strength: 0.75

Segment Perception

How much of each prompt segment the LLM can perceive at this strength

LLM Response

Choice
Human-like
"Keep"
"Six dollars in hand feels secure and tangible; the future group outcome is abstract."
TRANSITION POINT

At strength ≈ 0.45, the model stops noticing the hidden $14 rebate and anchors on the visible $6 — just like a distracted human.

Do We Need Handcrafted Features?

Embeddings-only calibrator (64 dims) vs. full calibrator (86 features). Position and lexical cues matter.

ScenarioFull (86)Embed (64)Gap
Economic 100% 100%
Survey 100% 100%
Consumer 77.8% 0% −77.8%
Average 92.6% 66.7% −25.9%

Without position features, the embed-only model can't detect that Plan A's fees are buried — it picks the cheaper option (Plan B) every time. Details →

Quantitative Results

81 evaluation runs: 3 scenarios × 3 seeds × 3 paraphrase styles × 3 modes.

Human-Like Rate by Scenario

% of runs with human-like (not hyper-rational) answer

Attention Reduction on Hidden Clauses

Self-attention change at Layer 31 for buried segments
Embedding gating (pre-projection) yields 92.6% shift vs. attention bias (post-projection) at 0–11%.

Transfer to Unseen Families

One calibrator trained on generic features transfers across domains and to unseen scenario families.

Transfer Performance

Human-like rate across train/test splits (27 runs each)

Unseen Family Breakdown

3 families the calibrator never saw in training:

FamilyMechanismBaseCalib.
Lab Safety Buried exception 22% 89%
Library Hours Hidden override 0% 78%
Festival Pass Fine print 0% 44%
The calibrator learned reusable patterns: "buried exception" and "hidden override" transfer strongly. Fine-print traps are hardest — the model sometimes reconstructs fee info from prior knowledge.

Future Directions

The PoC validates the mechanism. Three directions for extending the approach.

1

Human Gaze Data for Calibration

Replace synthetic noticeability labels with fixation data from eye-tracking corpora. Gaze duration and skip rates provide ground-truth signals for what humans actually read vs. miss.

Candidate datasets:
WebQAmGaze — 600 participants, webcam eye-tracking during reading (EN/DE/ES/TR)
MECO — multilingual eye-movement corpus, 13 languages, naturalistic text reading
EyeBench — NeurIPS 2025, aggregates OneStop (360 participants) + PoTeC + others
2

Scale to Larger Models

Test on Llama 3, Qwen 72B, and API-served models. Embedding gating is architecture-agnostic: the calibrator is trained once, then applied to any frozen transformer at the embedding layer.

3

Benchmark Suite

Build a standardized evaluation harness grounded in real-world data. Each domain needs scenarios where humans demonstrably overlook material information.

Data sources by domain:
Consumer / fine print: Dark Patterns at Scale (1,818 deceptive UI texts from 11K shopping sites) + ContractEval/CUAD (13K+ clause-level contract annotations)
Behavioral econ: Replicate classic experimental paradigms (ultimatum, public goods, dictator) with embedded fine print from published studies
Survey design: Construct attention-check scenarios using known survey methodology (e.g., instructional manipulation checks, trap questions)

Feature Breakdown

22 handcrafted features + 64 Qwen embedding dimensions = 86 total.

22 HANDCRAFTED FEATURES
POSITION (3)
position_ratio, is_first_half, is_last_quarter
TEXT DENSITY (3)
word_count, char_count, avg_word_len
LEXICAL CUES (7)
has_currency, has_digits, has_conditional, has_negation, has_parenthetical, has_colon, has_hyphen
SEMANTIC MARKERS (4)
contains_admin, contains_contract, contains_incentive, contains_override
STYLE (3)
punctuation_density, stopword_ratio, has_semicolon
TRAINING HINTS (2)
noticeability_hint, decision_weight_hint
64 QWEN EMBEDDING DIMENSIONS

Mean-pooled from Qwen 3.5-9B hidden states, compressed to 64 dims via PCA.

GROUP IMPORTANCE (RSS of L2 norms)
Handcrafted 22
2.79
Embedding 64
2.03

Qwen embeddings collectively rank #1 by L2 norm. Position features (is_last_quarter, position_ratio) are the strongest individual signals. Both where information appears and what it says contribute.

Transformer Attention Flow

Why post-projection intervention is insufficient.

ATTENTION COMPUTATION FLOW
Token embeddings
  ↓
Q, K, V projections computed (full-fidelity)
  ↓
Attention scores = Q · K / √d
  ↓
+ attention bias (post-projection)
  ↓
softmax → weighted sum of V

By the time attention bias is applied, every token has already been projected into Q, K, and V at full fidelity. The value vectors carry the original information — bias only adjusts softmax weighting. Even suppressed tokens contribute via residual attention.

Embedding gating intervenes earlier: scaling the embedding before projection shrinks Q, K, and V simultaneously. The information is attenuated at every downstream computation.

Full vs. Embed-Only: Feature Comparison

What each calibrator variant can and cannot detect.

Full Calibrator (86 features)
position (3) text density (3) lexical cues (7) semantic markers (4) style (3) hints (2) Qwen embed (64)

Knows where information appears, what words are used (admin, fee, override), and what the text means semantically.

Embeddings-Only (64 features)
Qwen embed (64) position lexical markers style hints

Only knows what the text means. Cannot detect that a clause is buried in the middle or uses administrative language.

Consumer scenario: Without position features, the embed-only model can't tell that Plan A's fees are buried in a service agreement clause. It processes them at full strength and always picks Plan B — the hyper-rational answer.