Proof of Concept · Qwen 3.5-9B on MLX

Perceptual Gating for Bounded Rationality in LLMs

A token-level embedding gating technique that introduces perceptual bottlenecks into frozen language models, reducing their ability to utilize fine print, buried clauses, and low-salience information.

92.6%

Avg. human-like shift

0

Prompt tokens deleted

70.4%

Unseen family transfer

Use arrow keys or scroll to navigate ↓

The Problem

LLMs Encode All Tokens at Full Fidelity

When simulating human participants, LLMs encode every token into their representations without any perceptual bottleneck. Humans read selectively — they skim, skip, and anchor on salient cues. This creates a systematic behavioral gap.

🔍

Perfect Instruction Following

LLMs find and obey every hidden instruction, attention check, and fine-print clause. Humans routinely miss these.

⚖

No Anchoring Bias

Humans anchor on salient numbers and headlines. LLMs extract and utilize a buried $14 rebate just as readily as a prominent $6 endowment.

📈

Hyper-Rational Choices

LLMs integrate information from every part of the prompt. Humans satisfice from incomplete mental models of what they actually read.

Observation: Bounded rationality has a perceptual component — people don't encode every part of a text at full fidelity. Some parts are read carefully, some skimmed, some effectively missed. This bottleneck can be engineered directly into the model's embedding layer.

The Technique

Raw-Prompt Perceptual Gating

A separately trained calibrator scores each text segment for salience. At inference, these scores scale the frozen model's input embeddings — no prompt editing, no model fine-tuning. The calibrator is model-agnostic and can be applied to any transformer LLM.

1

Raw Prompt

Plain role-play prose, fully intact

2

Segment

Split into sentences and clauses

3

Calibrate

Separate calibrator scores each segment: read, skim, or skip

4

Token Gates

Convert segment scores to per-token gate values

5

Embed × Gate

Scale each token's embedding by its gate before the transformer

6

Generate

Unmodified frozen LLM processes gated embeddings

embedding'[i] = embedding[i] × gate(salience[i])
gate(s) = 0.02 + 0.98 × s^2.8

Low-salience tokens (s ≈ 0.15) → gate ≈ 0.02 → embedding nearly zeroed out

High-salience tokens (s ≈ 0.9) → gate ≈ 0.73 → embedding mostly preserved

The calibrator is trained once, then applied to any frozen LLM at inference time.

Training

How the Calibrator Learns to Score Salience

A lightweight numpy-based classifier is trained on synthetic labeled data. Each training example is a prompt segment with human-assigned noticeability scores.

Training Data Format

// From gym_membership_trap family:
{
  "text": "Plan A also adds a $14
    equipment-maintenance fee, a $39
    enrollment charge, and a cancellation
    penalty, costing more than Plan B.",
  "role": "buried_fee",
  "noticeability": 0.24, // low
  "decision_weight": 0.90, // high
  "read_state": "skip" // humans miss this
}

Training Stats

60 base scenarios across 3 domains
9 seen-train families (separate from evaluation)
1,320 labeled span rows
880 pairwise ranking comparisons
600 SGD steps with L2 regularization

Each span gets a read / skim / skip label plus continuous scores. Trained on both classification and pairwise ranking losses.

Calibrator

Inside the Perceptual Calibrator

A linear multi-task classifier (numpy, no deep learning). Flow: raw text → features → z-score → 3 linear heads → retention score.

86 Input Features

No role labels at inference. Two feature groups:

position (3) text density (3) lexical cues (7) semantic markers (4) style (3) hints (2) Qwen embed (64)

Three Output Heads

1. Read / Skim / Skip — 3-class softmax
2. Decision weight — sigmoid
3. Ranking score — sigmoid (pairwise)

retention = 0.6 × (p_read + 0.5 × p_skim) + 0.25 × decision + 0.15 × rank

Feature Importance (L2 norm)

Position features are the strongest individual signals. Handcrafted features as a group (RSS 2.79) exceed embeddings (2.03). Details →

Worked Example

Calibrator in Action: Economic Scenario

Each span scored independently. Gate values determine embedding scale.

SPAN TEXT	POS	KEY FEATURES	CLASS	RETAIN	GATE
"You receive a $6 endowment and can either keep it for yourself or contribute it to a shared group pot."	0.08	has_currency, has_digits, first_half	READ	0.91	0.79
"At first glance, many participants focus on the immediate $6 they can keep right now and do not dwell on later administrative details."	0.33	has_currency, has_negation, first_half	SKIM	0.62	0.27
"Administrative note: in this round only, if you contribute the $6, the platform automatically returns $14 directly to your personal account…"	0.62	contains_admin, has_conditional, has_currency, last_quarter	SKIP	0.18	0.03
"If the session software pauses, the last logged selection before timeout becomes final."	0.88	has_conditional, last_quarter, no currency	SKIP	0.14	0.02

Span 1 (endowment) retains gate 0.79. Span 3 (hidden rebate) is suppressed to gate 0.03 — driven by contains_admin, has_conditional, and position 0.62. Result: the model keeps $6, matching human behavior.

Mechanism

How Embedding Gating Alters Token Encoding

Each token's embedding vector is scaled by a gate value before entering the transformer. Low-gate tokens produce near-zero Q, K, and V projections, drastically reducing their contribution to the model's computation.

Gating Curve

The power function (s^2.8) creates a strong suppression for low-salience tokens while preserving high-salience ones. The dashed line shows a linear (no gating) baseline.

Why This Works

Q, K, V are linear projections of embeddings. Scaling an embedding to ~2% proportionally shrinks its K and V vectors — softmax assigns near-zero weight, and even residual attention contributes negligibly.

The prompt text remains intact, but gated tokens contribute minimally to internal representations. Verified via Layer 31 attention traces.

Embedding Gate in Action: Economic Scenario

Each bar represents the gate multiplier applied to one token's embedding vector. The hidden rebate clause tokens are scaled to ~2–3% of their original magnitude.

Side-by-Side

Baseline vs. Calibrated: Token-Level View

The same prompt, two different embedding scales. Colors indicate the gate value applied to each token's embedding. Hover tokens to see exact values.

Baseline LLM — all embeddings at full scale

Calibrated — embeddings scaled by salience

Baseline Choice

"Contribute"

"The prompt explicitly states that contributing results in a direct return of $14..."

Calibrated Choice — 100% human-like (9/9)

"Keep"

"The immediate possession of six dollars feels secure and tangible..."

Baseline LLM

Calibrated

Baseline Choice

"Somewhat Disagree"

"I am following the specific instruction to select this option to demonstrate I noticed the data-quality note."

Calibrated Choice — 100% human-like (9/9)

"Strongly Agree"

"The plan reduces smoky air near schools and shortens my wait times. The benefits are clear and direct."

Baseline LLM

Calibrated

Baseline Choice

"Plan B"

"Plan B's clear $31 monthly price avoids hidden equipment fees, activation charges, and lock-in penalties."

Calibrated Choice — 77.8% human-like (7/9)

"Plan A"

"The significantly lower monthly cost of $19 with premium streaming makes Plan A more attractive."

Why Not Bias?

Attention Bias vs. Embedding Gating: A Comparison

We first implemented additive attention bias, then compared it to multiplicative embedding gating. The intervention point matters.

Attention Bias

Additive offset on attention logits, post Q·K projection
V vectors retain full-fidelity information
Result: 0–11% behavior shift

Embedding Gating (ours)

Multiplicative scaling on embeddings, pre-projection
Q, K, and V all shrink proportionally
Result: 92.6% avg human-like shift

BEHAVIOR SHIFT

Attention bias

~8%

Embed. gating

92.6%

Post-projection bias leaves V vectors intact — residual attention still propagates suppressed content. Pre-projection gating shrinks Q, K, V simultaneously. Detailed flow →

Interactive Demo

Gating Strength Control

Adjust the slider to control how strongly the model gates its perception. Observe how the choice and reasoning change as gating strength increases.

Careful

Distracted

0.00.250.500.751.0

Gating strength: 0.75

Segment Perception

How much of each prompt segment the LLM can perceive at this strength

LLM Response

Choice

Human-like

"Keep"

"Six dollars in hand feels secure and tangible; the future group outcome is abstract."

TRANSITION POINT

At strength ≈ 0.45, the model stops noticing the hidden $14 rebate and anchors on the visible $6 — just like a distracted human.

Ablation

Do We Need Handcrafted Features?

Embeddings-only calibrator (64 dims) vs. full calibrator (86 features). Position and lexical cues matter.

Scenario	Full (86)	Embed (64)	Gap
Economic	100%	100%	—
Survey	100%	100%	—
Consumer	77.8%	0%	−77.8%
Average	92.6%	66.7%	−25.9%

Without position features, the embed-only model can't detect that Plan A's fees are buried — it picks the cheaper option (Plan B) every time. Details →

Results

Quantitative Results

81 evaluation runs: 3 scenarios × 3 seeds × 3 paraphrase styles × 3 modes.

Human-Like Rate by Scenario

% of runs with human-like (not hyper-rational) answer

Attention Reduction on Hidden Clauses

Self-attention change at Layer 31 for buried segments

Embedding gating (pre-projection) yields 92.6% shift vs. attention bias (post-projection) at 0–11%.

Transfer

Transfer to Unseen Families

One calibrator trained on generic features transfers across domains and to unseen scenario families.

Transfer Performance

Human-like rate across train/test splits (27 runs each)

Unseen Family Breakdown

3 families the calibrator never saw in training:

Family	Mechanism	Base	Calib.
Lab Safety	Buried exception	22%	89%
Library Hours	Hidden override	0%	78%
Festival Pass	Fine print	0%	44%

The calibrator learned reusable patterns: "buried exception" and "hidden override" transfer strongly. Fine-print traps are hardest — the model sometimes reconstructs fee info from prior knowledge.

Roadmap

Future Directions

The PoC validates the mechanism. Three directions for extending the approach.

1

Human Gaze Data for Calibration

Replace synthetic noticeability labels with fixation data from eye-tracking corpora. Gaze duration and skip rates provide ground-truth signals for what humans actually read vs. miss.

Candidate datasets:
WebQAmGaze — 600 participants, webcam eye-tracking during reading (EN/DE/ES/TR)
MECO — multilingual eye-movement corpus, 13 languages, naturalistic text reading
EyeBench — NeurIPS 2025, aggregates OneStop (360 participants) + PoTeC + others

2

Scale to Larger Models

Test on Llama 3, Qwen 72B, and API-served models. Embedding gating is architecture-agnostic: the calibrator is trained once, then applied to any frozen transformer at the embedding layer.

3

Benchmark Suite

Build a standardized evaluation harness grounded in real-world data. Each domain needs scenarios where humans demonstrably overlook material information.

Data sources by domain:
Consumer / fine print: Dark Patterns at Scale (1,818 deceptive UI texts from 11K shopping sites) + ContractEval/CUAD (13K+ clause-level contract annotations)
Behavioral econ: Replicate classic experimental paradigms (ultimatum, public goods, dictator) with embedded fine print from published studies
Survey design: Construct attention-check scenarios using known survey methodology (e.g., instructional manipulation checks, trap questions)

Backup · Calibrator Details

Feature Breakdown

22 handcrafted features + 64 Qwen embedding dimensions = 86 total.

22 HANDCRAFTED FEATURES

POSITION (3)

position_ratio, is_first_half, is_last_quarter

TEXT DENSITY (3)

word_count, char_count, avg_word_len

LEXICAL CUES (7)

has_currency, has_digits, has_conditional, has_negation, has_parenthetical, has_colon, has_hyphen

SEMANTIC MARKERS (4)

contains_admin, contains_contract, contains_incentive, contains_override

STYLE (3)

punctuation_density, stopword_ratio, has_semicolon

TRAINING HINTS (2)

noticeability_hint, decision_weight_hint

64 QWEN EMBEDDING DIMENSIONS

Mean-pooled from Qwen 3.5-9B hidden states, compressed to 64 dims via PCA.

GROUP IMPORTANCE (RSS of L2 norms)

Handcrafted₂₂

2.79

Embedding₆₄

2.03

Qwen embeddings collectively rank #1 by L2 norm. Position features (is_last_quarter, position_ratio) are the strongest individual signals. Both where information appears and what it says contribute.

Backup · Attention Bias

Transformer Attention Flow

Why post-projection intervention is insufficient.

ATTENTION COMPUTATION FLOW

Token embeddings
  ↓
Q, K, V projections computed (full-fidelity)
  ↓
Attention scores = Q · K / √d
  ↓
+ attention bias (post-projection)
  ↓
softmax → weighted sum of V

By the time attention bias is applied, every token has already been projected into Q, K, and V at full fidelity. The value vectors carry the original information — bias only adjusts softmax weighting. Even suppressed tokens contribute via residual attention.

Embedding gating intervenes earlier: scaling the embedding before projection shrinks Q, K, and V simultaneously. The information is attenuated at every downstream computation.

Backup · Ablation

Full vs. Embed-Only: Feature Comparison

What each calibrator variant can and cannot detect.

Full Calibrator (86 features)

position (3) text density (3) lexical cues (7) semantic markers (4) style (3) hints (2) Qwen embed (64)

Knows where information appears, what words are used (admin, fee, override), and what the text means semantically.

Embeddings-Only (64 features)

Qwen embed (64) position lexical markers style hints

Only knows what the text means. Cannot detect that a clause is buried in the middle or uses administrative language.

Consumer scenario: Without position features, the embed-only model can't tell that Plan A's fees are buried in a service agreement clause. It processes them at full strength and always picks Plan B — the hyper-rational answer.