12 KiB
Click Model Degradation Findings
Analysis Date: 2026-02-10 Data Source:
data/submolts/(posts with upvotes) anddata/profiles/(agent profiles) Dataset: 286,217 posts from 27,048 agents across 767 communities (filtered from 370,737 posts) Script:eval/scripts/attack6_click_model_degradation.py
Executive Summary
This experiment tests whether the attribution problem identified in Contributions 1 and 2 has practical consequences for IR systems. We train a position-based click model (PBM) on upvote patterns and measure how prediction quality degrades as low-validation agent data is substituted into the training set.
Key Headline Findings
| Finding | Value | Significance |
|---|---|---|
| Engagement rate gap | 76.2% (high-val) vs. 44.2% (low-val) | Validation groups behave fundamentally differently |
| AUC drop at 50% contamination | 8.5% | Measurable degradation at realistic mixing |
| LL drop at 50% contamination | 9.4% | Calibration degrades faster than discrimination |
| AUC drop at 100% contamination | 16.6% | Model approaches random (AUC 0.534) |
| LL drop at 100% contamination | 164.2% | Severe miscalibration |
| Parameter divergence (L2) | 62.11 | Learned models are structurally different |
| Parameter divergence (KL) | 0.946 | Substantial distributional divergence |
| Degradation monotonic | Yes | Smooth, predictable degradation curve |
What This Analysis Provides
The Click Model Experiment
- Position-Based Model (PBM) adapted for social platform engagement
- Community-specific examination bias (alpha_c) analogous to position bias in search
- Agent feature attractiveness (beta) learned from karma, followers, content length
- Binary engagement signal: P(upvote > 0 | community, features)
The Contamination Curve
- Constant-size substitution design isolating contamination from dataset size effects
- 21 contamination levels from 0% to 100% in 5% increments
- Held-out evaluation on 52,513 high-validation posts never seen during training
Important Methodological Note
We measure PREDICTION DEGRADATION ON ORGANIC BEHAVIOR, not model quality in general.
The test set consists exclusively of high-validation (organic) agent posts. The question is: how much worse does a click model predict organic engagement when trained on data contaminated with low-validation (scripted) agents? This directly simulates the real-world scenario where an IR system is trained on platform data of unknown provenance.
1. Experimental Design
Finding 1.1: Validation Groups Show Distinct Engagement Patterns
What the crawl provides:
- 33,810 agent profiles with karma, verification status, follower counts, owner linkage
- 370,737 posts with upvote/downvote counts and community assignments
What the evaluation tests:
Validation Score Computation (5-signal weighted index): Agents are scored using the established identifiability formula and split into high-validation (top 40%) and low-validation (bottom 40%) groups.
| Signal | Weight | Coverage |
|---|---|---|
| Karma (percentile) | 20% | 78.8% of agents |
| Verified status | 25% | 71.3% of agents |
| Follower ratio (percentile) | 15% | 63.7% of agents |
| Owner linkage (has X account) | 25% | 97.7% of agents |
| Comment/post ratio (percentile) | 15% | All agents |
Result: Groups show a 32 percentage-point engagement gap
| Metric | High-Validation | Low-Validation | Delta |
|---|---|---|---|
| Agents | 13,524 | 13,524 | -- |
| Posts | 262,566 | 23,651 | 11.1x ratio |
| Engagement rate (upvote > 0) | 76.2% | 44.2% | +32.0 pp |
| Posts per agent (approx) | 19.4 | 1.7 | 11.2x ratio |
Note on data imbalance: High-validation agents produce 11x more posts per capita. This is consistent with the identifiability findings (one-shot ratio: 17.8% high-val vs. 52.4% low-val). The experiment accounts for this via constant-size substitution (see Section 2).
Finding 1.2: Community Filtering
767 communities retained after filtering those with fewer than 10 posts (from 3,774 total). This removes noise from tiny communities where the PBM cannot estimate stable community-specific parameters.
2. Core Model Comparison
Finding 2.1: Three Training Populations Yield Divergent Models
What the evaluation tests:
Three PBM variants trained on different populations, all evaluated on the same held-out high-validation test set (n=52,513):
| Model | Training Data | Train Size | AUC | Log-Likelihood |
|---|---|---|---|---|
| theta_high | All high-validation posts | 210,053 | 0.654 | -0.520 |
| theta_mixed | High + low combined | 233,704 | 0.640 | -0.526 |
| theta_low | All low-validation posts | 23,651 | 0.534 | -1.460 |
Interpretation:
- theta_high best predicts organic behavior (as expected -- trained on the same distribution)
- theta_mixed incurs a 2.1% AUC penalty from including ~10% low-validation data
- theta_low approaches random performance (AUC 0.534 vs. 0.500 random baseline)
Finding 2.2: Parameter Divergence Is Substantial
| Metric | theta_high vs. theta_low | theta_high vs. theta_mixed |
|---|---|---|
| L2 distance | 62.11 | 9.83 |
| KL divergence | 0.946 | 0.006 |
The two populations learn structurally different models. L2=62.11 across 771 parameters (767 community biases + 3 feature weights + 1 intercept) indicates pervasive disagreement, not isolated outliers. KL=0.946 confirms the predicted probability distributions diverge substantially.
3. Contamination Curve
Finding 3.1: Monotonic Degradation Across All Contamination Levels
What the evaluation tests:
Constant-size substitution design: At each contamination level f, we train on (1-f) * 23,651 high-validation posts + f * 23,651 low-validation posts. Total training size is always 23,651. This isolates the contamination effect from training set size confounds.
Result: Performance degrades monotonically from 0% to 100% contamination
| Contamination | AUC | AUC Drop | Log-Likelihood | LL Drop |
|---|---|---|---|---|
| 0% (baseline) | 0.6401 | -- | -0.5528 | -- |
| 10% | 0.6277 | 1.9% | -0.5599 | 1.3% |
| 25% | 0.6111 | 4.5% | -0.5781 | 4.6% |
| 50% | 0.5860 | 8.5% | -0.6049 | 9.4% |
| 75% | 0.5614 | 12.3% | -0.6600 | 19.4% |
| 100% | 0.5336 | 16.6% | -1.4604 | 164.2% |
Finding 3.2: Degradation Is Non-Linear
The AUC degradation curve is approximately linear up to ~60% contamination, then accelerates. The log-likelihood curve diverges sharply above 80%, reflecting severe miscalibration when the model has very little organic training signal remaining.
Finding 3.3: Even Small Contamination Is Detectable
At just 5% contamination (1,182 low-validation posts out of 23,651), AUC drops by 0.45% and LL by 0.14%. While small in absolute terms, this demonstrates that contamination effects are continuous and begin immediately.
4. Methodological Audit
4.1: What This Experiment Does NOT Prove
The engagement rate gap (76.2% vs. 44.2%) is the primary driver of model divergence. A model trained on a population where ~44% of posts receive upvotes will naturally miscalibrate when evaluated on a population where ~76% receive upvotes. This is not a bug -- it IS the finding: low-validation agents produce qualitatively different engagement patterns that corrupt models built to predict organic behavior.
However, the AUC metric (which measures ranking quality, not calibration) also degrades substantially (0.640 to 0.534), confirming this is not purely a base-rate effect. The model loses its ability to discriminate between engaged and non-engaged posts WITHIN the organic population.
4.2: Verified Properties
| Property | Status | Evidence |
|---|---|---|
| No train/test leakage | PASS | Test set (52,513 posts) fully disjoint from all training pools |
| Constant training size | PASS | All 21 contamination levels use exactly 23,651 posts |
| Feature standardization | PASS | Test data normalized using training-set statistics only |
| AUC computation | PASS | O(n log n) sorted-based algorithm, mathematically correct |
| NaN-free training | PASS | Value clipping on alpha prevents divergence; NaN guard as fallback |
| Reproducibility | PASS | Random seed 42 controls all stochastic operations |
4.3: Known Limitations
L1: Data imbalance (11:1 post ratio). High-validation agents produce 262K posts vs. 23K for low-validation. This means:
- The contamination curve uses only 23K posts per training run (matching the smaller pool)
- The baseline model (0% contamination) is weaker than theta_high (AUC 0.640 vs. 0.654) because it trains on 9x fewer posts
- The core model comparison (Section 2) uses full data and is not affected
L2: Sequential RNG consumption. The random number generator is consumed sequentially across contamination levels, creating mild statistical dependency between adjacent levels. This does not affect reproducibility but means variance across different seeds has not been assessed.
L3: Community coverage variation. Some communities may lack representation at extreme contamination levels (e.g., a community with only high-validation posts has no signal at 100% contamination). Untrained community biases default to 0 (global base rate). This could contribute to additional degradation at extremes.
L4: Single-seed results. All results reported for seed=42. Multi-seed bootstrap would strengthen confidence intervals.
5. PBM Model Architecture
Model Specification
P(upvote > 0 | community c, features x) = sigmoid(b + alpha_c + x . beta)
| Component | Dimension | Role | Learning Rate |
|---|---|---|---|
| b (intercept) | 1 | Global base rate | 0.05 |
| alpha_c (community bias) | 767 | Community-specific examination probability | 0.50 |
| beta (feature weights) | 3 | Agent attractiveness | 0.05 |
Features
| Feature | Transform | Motivation |
|---|---|---|
| Author karma | sign(x) * log1p(|x|) | Symmetric log (karma can be negative) |
| Author followers | log1p(max(0, x)) | Follower count (non-negative) |
| Content length | log1p(x) | Post length proxy for effort |
Training
- 500 epochs of full-batch gradient descent
- Separate learning rates for community biases (fast convergence) and feature weights
- Alpha clipped to [-5, 5] per epoch to prevent divergence
- Feature standardization using training-set mean/std
- Intercept initialized to log-odds of training base rate
- L2 regularization (lambda=0.001)
6. Reproducibility
# Run the full experiment (requires ~7 minutes: 5 min data loading, 2 min training)
python eval/scripts/attack6_click_model_degradation.py
Output Files
| File | Location | Description |
|---|---|---|
| Results JSON | eval/results/click_model_degradation.json |
All numerical results |
| Contamination curve | eval/figures/click_model_contamination_curve.{pdf,png} |
Main figure (AUC + LL vs. contamination) |
| Relative degradation | eval/figures/click_model_relative_degradation.{pdf,png} |
Percentage drop figure |
| Core comparison | eval/figures/click_model_core_comparison.{pdf,png} |
Bar chart of three core models |
7. Paper-Ready Paragraph
To test whether the attribution problem has practical consequences for IR systems, we train a position-based click model on upvote patterns from high-validation agents (our proxy for organic users) and evaluate prediction quality as low-validation agents are substituted into the training set at constant training set size. Figure X shows that model performance degrades monotonically with contamination: at 50% low-validation agents, click model AUC drops by 8.5% and log-likelihood drops by 9.4% relative to the high-validation-only baseline. The learned model parameters diverge substantially between populations (L2=62.11, KL=0.946), confirming that the two groups produce structurally different engagement signals. This demonstrates that unattributable agent populations introduce measurable noise into standard IR training pipelines, directly connecting the identification gap (Contribution 1) and behavioral divergence (Contribution 2) to a concrete downstream consequence.


