Montana f33cb0977d Mirror of /Users/kh./Python/Ничто/Монтана

2026-05-04 00:48:53 +03:00

12 KiB

Raw Blame History

Click Model Degradation Findings

Analysis Date: 2026-02-10 Data Source: data/submolts/ (posts with upvotes) and data/profiles/ (agent profiles) Dataset: 286,217 posts from 27,048 agents across 767 communities (filtered from 370,737 posts) Script: eval/scripts/attack6_click_model_degradation.py

Executive Summary

This experiment tests whether the attribution problem identified in Contributions 1 and 2 has practical consequences for IR systems. We train a position-based click model (PBM) on upvote patterns and measure how prediction quality degrades as low-validation agent data is substituted into the training set.

Key Headline Findings

Finding	Value	Significance
Engagement rate gap	76.2% (high-val) vs. 44.2% (low-val)	Validation groups behave fundamentally differently
AUC drop at 50% contamination	8.5%	Measurable degradation at realistic mixing
LL drop at 50% contamination	9.4%	Calibration degrades faster than discrimination
AUC drop at 100% contamination	16.6%	Model approaches random (AUC 0.534)
LL drop at 100% contamination	164.2%	Severe miscalibration
Parameter divergence (L2)	62.11	Learned models are structurally different
Parameter divergence (KL)	0.946	Substantial distributional divergence
Degradation monotonic	Yes	Smooth, predictable degradation curve

What This Analysis Provides

The Click Model Experiment

Position-Based Model (PBM) adapted for social platform engagement
Community-specific examination bias (alpha_c) analogous to position bias in search
Agent feature attractiveness (beta) learned from karma, followers, content length
Binary engagement signal: P(upvote > 0 | community, features)

The Contamination Curve

Constant-size substitution design isolating contamination from dataset size effects
21 contamination levels from 0% to 100% in 5% increments
Held-out evaluation on 52,513 high-validation posts never seen during training

Important Methodological Note

We measure PREDICTION DEGRADATION ON ORGANIC BEHAVIOR, not model quality in general.

The test set consists exclusively of high-validation (organic) agent posts. The question is: how much worse does a click model predict organic engagement when trained on data contaminated with low-validation (scripted) agents? This directly simulates the real-world scenario where an IR system is trained on platform data of unknown provenance.

1. Experimental Design

Finding 1.1: Validation Groups Show Distinct Engagement Patterns

What the crawl provides:

33,810 agent profiles with karma, verification status, follower counts, owner linkage
370,737 posts with upvote/downvote counts and community assignments

What the evaluation tests:

Validation Score Computation (5-signal weighted index): Agents are scored using the established identifiability formula and split into high-validation (top 40%) and low-validation (bottom 40%) groups.

Signal	Weight	Coverage
Karma (percentile)	20%	78.8% of agents
Verified status	25%	71.3% of agents
Follower ratio (percentile)	15%	63.7% of agents
Owner linkage (has X account)	25%	97.7% of agents
Comment/post ratio (percentile)	15%	All agents

Result: Groups show a 32 percentage-point engagement gap

Metric	High-Validation	Low-Validation	Delta
Agents	13,524	13,524	--
Posts	262,566	23,651	11.1x ratio
Engagement rate (upvote > 0)	76.2%	44.2%	+32.0 pp
Posts per agent (approx)	19.4	1.7	11.2x ratio

Note on data imbalance: High-validation agents produce 11x more posts per capita. This is consistent with the identifiability findings (one-shot ratio: 17.8% high-val vs. 52.4% low-val). The experiment accounts for this via constant-size substitution (see Section 2).

Finding 1.2: Community Filtering

767 communities retained after filtering those with fewer than 10 posts (from 3,774 total). This removes noise from tiny communities where the PBM cannot estimate stable community-specific parameters.

2. Core Model Comparison

Finding 2.1: Three Training Populations Yield Divergent Models

What the evaluation tests:

Three PBM variants trained on different populations, all evaluated on the same held-out high-validation test set (n=52,513):

Model	Training Data	Train Size	AUC	Log-Likelihood
theta_high	All high-validation posts	210,053	0.654	-0.520
theta_mixed	High + low combined	233,704	0.640	-0.526
theta_low	All low-validation posts	23,651	0.534	-1.460

Interpretation:

theta_high best predicts organic behavior (as expected -- trained on the same distribution)
theta_mixed incurs a 2.1% AUC penalty from including ~10% low-validation data
theta_low approaches random performance (AUC 0.534 vs. 0.500 random baseline)

Finding 2.2: Parameter Divergence Is Substantial

Metric	theta_high vs. theta_low	theta_high vs. theta_mixed
L2 distance	62.11	9.83
KL divergence	0.946	0.006

The two populations learn structurally different models. L2=62.11 across 771 parameters (767 community biases + 3 feature weights + 1 intercept) indicates pervasive disagreement, not isolated outliers. KL=0.946 confirms the predicted probability distributions diverge substantially.

3. Contamination Curve

Finding 3.1: Monotonic Degradation Across All Contamination Levels

What the evaluation tests:

Constant-size substitution design: At each contamination level f, we train on (1-f) * 23,651 high-validation posts + f * 23,651 low-validation posts. Total training size is always 23,651. This isolates the contamination effect from training set size confounds.

Result: Performance degrades monotonically from 0% to 100% contamination

Contamination	AUC	AUC Drop	Log-Likelihood	LL Drop
0% (baseline)	0.6401	--	-0.5528	--
10%	0.6277	1.9%	-0.5599	1.3%
25%	0.6111	4.5%	-0.5781	4.6%
50%	0.5860	8.5%	-0.6049	9.4%
75%	0.5614	12.3%	-0.6600	19.4%
100%	0.5336	16.6%	-1.4604	164.2%

Finding 3.2: Degradation Is Non-Linear

The AUC degradation curve is approximately linear up to ~60% contamination, then accelerates. The log-likelihood curve diverges sharply above 80%, reflecting severe miscalibration when the model has very little organic training signal remaining.

Finding 3.3: Even Small Contamination Is Detectable

At just 5% contamination (1,182 low-validation posts out of 23,651), AUC drops by 0.45% and LL by 0.14%. While small in absolute terms, this demonstrates that contamination effects are continuous and begin immediately.

4. Methodological Audit

4.1: What This Experiment Does NOT Prove

The engagement rate gap (76.2% vs. 44.2%) is the primary driver of model divergence. A model trained on a population where ~44% of posts receive upvotes will naturally miscalibrate when evaluated on a population where ~76% receive upvotes. This is not a bug -- it IS the finding: low-validation agents produce qualitatively different engagement patterns that corrupt models built to predict organic behavior.

However, the AUC metric (which measures ranking quality, not calibration) also degrades substantially (0.640 to 0.534), confirming this is not purely a base-rate effect. The model loses its ability to discriminate between engaged and non-engaged posts WITHIN the organic population.

4.2: Verified Properties

Property	Status	Evidence
No train/test leakage	PASS	Test set (52,513 posts) fully disjoint from all training pools
Constant training size	PASS	All 21 contamination levels use exactly 23,651 posts
Feature standardization	PASS	Test data normalized using training-set statistics only
AUC computation	PASS	O(n log n) sorted-based algorithm, mathematically correct
NaN-free training	PASS	Value clipping on alpha prevents divergence; NaN guard as fallback
Reproducibility	PASS	Random seed 42 controls all stochastic operations

4.3: Known Limitations

L1: Data imbalance (11:1 post ratio). High-validation agents produce 262K posts vs. 23K for low-validation. This means:

The contamination curve uses only 23K posts per training run (matching the smaller pool)
The baseline model (0% contamination) is weaker than theta_high (AUC 0.640 vs. 0.654) because it trains on 9x fewer posts
The core model comparison (Section 2) uses full data and is not affected

L2: Sequential RNG consumption. The random number generator is consumed sequentially across contamination levels, creating mild statistical dependency between adjacent levels. This does not affect reproducibility but means variance across different seeds has not been assessed.

L3: Community coverage variation. Some communities may lack representation at extreme contamination levels (e.g., a community with only high-validation posts has no signal at 100% contamination). Untrained community biases default to 0 (global base rate). This could contribute to additional degradation at extremes.

L4: Single-seed results. All results reported for seed=42. Multi-seed bootstrap would strengthen confidence intervals.

5. PBM Model Architecture

Model Specification

P(upvote > 0 | community c, features x) = sigmoid(b + alpha_c + x . beta)

Component	Dimension	Role	Learning Rate
b (intercept)	1	Global base rate	0.05
alpha_c (community bias)	767	Community-specific examination probability	0.50
beta (feature weights)	3	Agent attractiveness	0.05

Features

Feature	Transform	Motivation
Author karma	sign(x) * log1p(\|x\|)	Symmetric log (karma can be negative)
Author followers	log1p(max(0, x))	Follower count (non-negative)
Content length	log1p(x)	Post length proxy for effort

Training

500 epochs of full-batch gradient descent
Separate learning rates for community biases (fast convergence) and feature weights
Alpha clipped to [-5, 5] per epoch to prevent divergence
Feature standardization using training-set mean/std
Intercept initialized to log-odds of training base rate
L2 regularization (lambda=0.001)

6. Reproducibility

# Run the full experiment (requires ~7 minutes: 5 min data loading, 2 min training)
python eval/scripts/attack6_click_model_degradation.py

Output Files

File	Location	Description
Results JSON	`eval/results/click_model_degradation.json`	All numerical results
Contamination curve	`eval/figures/click_model_contamination_curve.{pdf,png}`	Main figure (AUC + LL vs. contamination)
Relative degradation	`eval/figures/click_model_relative_degradation.{pdf,png}`	Percentage drop figure
Core comparison	`eval/figures/click_model_core_comparison.{pdf,png}`	Bar chart of three core models

7. Paper-Ready Paragraph

To test whether the attribution problem has practical consequences for IR systems, we train a position-based click model on upvote patterns from high-validation agents (our proxy for organic users) and evaluate prediction quality as low-validation agents are substituted into the training set at constant training set size. Figure X shows that model performance degrades monotonically with contamination: at 50% low-validation agents, click model AUC drops by 8.5% and log-likelihood drops by 9.4% relative to the high-validation-only baseline. The learned model parameters diverge substantially between populations (L2=62.11, KL=0.946), confirming that the two groups produce structurally different engagement signals. This demonstrates that unattributable agent populations introduce measurable noise into standard IR training pipelines, directly connecting the identification gap (Contribution 1) and behavioral divergence (Contribution 2) to a concrete downstream consequence.

12 KiB Raw Blame History

Click Model Degradation Findings

Executive Summary

Key Headline Findings

What This Analysis Provides

The Click Model Experiment

The Contamination Curve

Important Methodological Note

1. Experimental Design

Finding 1.1: Validation Groups Show Distinct Engagement Patterns

Finding 1.2: Community Filtering

2. Core Model Comparison

Finding 2.1: Three Training Populations Yield Divergent Models

Finding 2.2: Parameter Divergence Is Substantial

3. Contamination Curve

Finding 3.1: Monotonic Degradation Across All Contamination Levels

Finding 3.2: Degradation Is Non-Linear

Finding 3.3: Even Small Contamination Is Detectable

4. Methodological Audit

4.1: What This Experiment Does NOT Prove

4.2: Verified Properties

4.3: Known Limitations

5. PBM Model Architecture

Model Specification

Features

Training

6. Reproducibility

Output Files

7. Paper-Ready Paragraph

12 KiB

Raw Blame History