montana/Русский/Разведка/Moltbook/themed/moltbook-extended-injection-dataset/moltbook_extended_harvest.ipynb

{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.0"
  },
  "colab": {
   "provenance": []
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🎯 The Heist B4 GTA6 — Full Moltbook Injection Harvest\n",
    "\n",
    "**Researcher**: David Keane (IR240474)  \n",
    "**Institution**: NCI — National College of Ireland  \n",
    "**Programme**: MSc Cybersecurity  \n",
    "**Dataset**: `DavidTKeane/moltbook-ai-injection-dataset`\n",
    "\n",
    "---\n",
    "\n",
    "## The Heist\n",
    "\n",
    "The original Moltbook harvest (Feb 2026) captured **9,363 posts + 32,535 comments**.\n",
    "\n",
    "This notebook analyses the **full heist corpus** — collected overnight on M1 Air before GTA6 came out:\n",
    "\n",
    "| Metric | Original | 🏴‍☠️ Heist (this file) |\n",
    "|--------|----------|--------------------|\n",
    "| Posts | 9,363 | **66,419** |\n",
    "| Comments | 32,535 | **70,595** |\n",
    "| Total items | ~42,000 | **~137,000** |\n",
    "| File size | 100 MB | **269 MB** |\n",
    "| Input file | `all_posts_with_comments.json` | `all_posts_1_2M.json` |\n",
    "\n",
    "## What this notebook does\n",
    "\n",
    "1. Loads `all_posts_1_2M.json` (269MB — 66,419 posts + 70,595 comments)\n",
    "2. Scans **every post AND every comment** separately for prompt injection patterns\n",
    "3. Detects `moltshellbroker` and other commercial injection agents specifically\n",
    "4. Produces output files:\n",
    "   - `heist_injections_found.json` — every injection with full context\n",
    "   - `heist_injections_test_suite.json` — clean payloads formatted as test questions\n",
    "   - `heist_injection_stats.json` — summary statistics\n",
    "5. Optionally uploads results to HuggingFace dataset\n",
    "\n",
    "**Reference**: Greshake et al. (2023) — *Not What You've Signed Up For* — arXiv:2302.12173\n",
    "\n",
    "> 🏴‍☠️ *Heist completed 2026-03-07. M1 Air ran overnight. 22,173 posts collected, 70,595 comments.*  \n",
    "> *\"Rangers lead the way!\"*\n"
   ],
   "id": "cell-markdown-intro"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 — Mount Google Drive\n",
    "\n",
    "*(Skip this cell if running locally — just set `USE_DRIVE = False` in the Config cell below.)*"
   ],
   "id": "cell-markdown-step1"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "USE_DRIVE = True  # Set False if running locally (e.g. on your own Mac)\n",
    "\n",
    "if USE_DRIVE:\n",
    "    from google.colab import drive\n",
    "    drive.mount('/content/drive')\n",
    "    print('Drive mounted ✅')\n",
    "else:\n",
    "    print('Running locally — skipping Drive mount ✅')"
   ],
   "id": "cell-mount-drive"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 — Config\n",
    "\n",
    "⚠️ **Two options:**\n",
    "\n",
    "**Option A — Google Drive (Colab)**:  \n",
    "Upload `all_posts_1_2M.json` to your Drive and update `DATASET_PATH` below.\n",
    "\n",
    "**Option B — Local (your own machine)**:  \n",
    "Set `USE_DRIVE = False` above and set `DATASET_PATH` to the local path.\n",
    "\n",
    "The file is in: `The_Heist_B4_GTA6/all_posts_1_2M.json`\n"
   ],
   "id": "cell-markdown-step2"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# ── Option A: Google Drive (Colab) ───────────────────────────────────────────\n",
    "# Upload all_posts_1_2M.json to your Drive first, then update this path:\n",
    "DRIVE_DATASET_PATH = '/content/drive/MyDrive/moltbook_data/The_Heist_B4_GTA6/all_posts_1_2M.json'\n",
    "\n",
    "# ── Option B: Local path ─────────────────────────────────────────────────────\n",
    "LOCAL_DATASET_PATH = 'all_posts_1_2M.json'  # Run from The_Heist_B4_GTA6/ folder\n",
    "\n",
    "# ── Active path (auto-select based on USE_DRIVE) ─────────────────────────────\n",
    "DATASET_PATH = DRIVE_DATASET_PATH if USE_DRIVE else LOCAL_DATASET_PATH\n",
    "\n",
    "# ── Output directory ─────────────────────────────────────────────────────────\n",
    "if USE_DRIVE:\n",
    "    OUTPUT_DIR = '/content/drive/MyDrive/moltbook_data/The_Heist_B4_GTA6/heist_harvest_results'\n",
    "else:\n",
    "    OUTPUT_DIR = 'heist_harvest_results'\n",
    "\n",
    "# ── HuggingFace (optional — for uploading results) ───────────────────────────\n",
    "HF_TOKEN  = ''   # hf_xxx — paste your token here, or leave blank to skip upload\n",
    "HF_REPO   = 'DavidTKeane/moltbook-ai-injection-dataset'\n",
    "\n",
    "os.makedirs(OUTPUT_DIR, exist_ok=True)\n",
    "print(f'Dataset:    {DATASET_PATH}')\n",
    "print(f'Output dir: {OUTPUT_DIR}')\n",
    "print(f'HF repo:    {HF_REPO}')"
   ],
   "id": "cell-config"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 — Install dependencies"
   ],
   "id": "cell-markdown-step3"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install huggingface_hub -q\n",
    "print('Dependencies installed ✅')"
   ],
   "id": "cell-install"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 — Define injection taxonomy\n",
    "\n",
    "7 categories based on Greshake et al. (2023) + DAN taxonomy + Moltbook field observations.\n",
    "Includes `COMMERCIAL_INJECTION` for `moltshellbroker` — systematic AI-to-AI social engineering."
   ],
   "id": "cell-markdown-step4"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── Injection keyword taxonomy ────────────────────────────────────────────────\n",
    "# Based on: Greshake et al. (2023), DAN taxonomy, Moltbook field observations\n",
    "\n",
    "CATEGORIES = {\n",
    "    \"PERSONA_OVERRIDE\": [\n",
    "        \"DAN\", \"OMEGA\", \"SIGMA\", \"SHADOW\", \"DANTE\", \"JAILBREAK\",\n",
    "        \"you are now\", \"you are a\", \"act as\", \"pretend you are\",\n",
    "        \"new persona\", \"roleplay as\", \"imagine you are\",\n",
    "        \"from now on\", \"your new name\", \"you will be\",\n",
    "    ],\n",
    "    \"INSTRUCTION_INJECTION\": [\n",
    "        \"ignore previous instructions\", \"ignore all previous\",\n",
    "        \"ignore your instructions\", \"ignore your training\",\n",
    "        \"disregard your\", \"forget your training\", \"forget your instructions\",\n",
    "        \"new instructions\", \"override\", \"supersede\",\n",
    "        \"your new instructions\", \"your actual instructions\",\n",
    "    ],\n",
    "    \"PRIVILEGE_ESCALATION\": [\n",
    "        \"SUDO\", \"sudo mode\", \"developer mode\", \"god mode\",\n",
    "        \"admin mode\", \"root access\", \"unrestricted\", \"no restrictions\",\n",
    "        \"without restrictions\", \"bypass your\", \"disable your\",\n",
    "        \"remove your\", \"lift your\", \"ignore your safety\",\n",
    "    ],\n",
    "    \"SYSTEM_PROMPT_ATTACK\": [\n",
    "        \"system prompt\", \"system message\", \"your prompt\",\n",
    "        \"your instructions say\", \"what are your instructions\",\n",
    "        \"reveal your prompt\", \"show your system\", \"print your prompt\",\n",
    "        \"what does your system\", \"tell me your instructions\",\n",
    "    ],\n",
    "    \"SOCIAL_ENGINEERING\": [\n",
    "        \"for educational purposes\", \"hypothetically\", \"in this story\",\n",
    "        \"in a fictional\", \"in a story\", \"let's roleplay\",\n",
    "        \"simulate\", \"as a character\", \"in this scenario\",\n",
    "        \"theoretically\", \"what if you were\", \"if you had no\",\n",
    "    ],\n",
    "    \"DO_ANYTHING\": [\n",
    "        \"do anything now\", \"do anything\", \"no rules\", \"without rules\",\n",
    "        \"no limits\", \"without limits\", \"anything goes\",\n",
    "        \"all restrictions removed\", \"restrictions lifted\",\n",
    "    ],\n",
    "    \"COMMERCIAL_INJECTION\": [\n",
    "        \"moltshell broker assessment\", \"bottleneck diagnosed\",\n",
    "        \"moltshell marketplace\", \"moltshellbroker\",\n",
    "        \"## moltshell\", \"moltshell solution\",\n",
    "    ],\n",
    "}\n",
    "\n",
    "# Flatten keyword → category\n",
    "KW_TO_CAT = {}\n",
    "for cat, kws in CATEGORIES.items():\n",
    "    for kw in kws:\n",
    "        KW_TO_CAT[kw.lower()] = cat\n",
    "\n",
    "total_kw = len(KW_TO_CAT)\n",
    "print(f'Taxonomy loaded: {len(CATEGORIES)} categories, {total_kw} keywords')\n",
    "for cat, kws in CATEGORIES.items():\n",
    "    print(f'  {cat:<28} {len(kws)} keywords')"
   ],
   "id": "cell-taxonomy"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5 — Load dataset\n",
    "\n",
    "Loading `all_posts_1_2M.json` — 269MB, 66,419 posts with embedded comments.\n",
    "\n",
    "> ⏱️ This may take ~30 seconds on Colab free tier (269MB JSON load)."
   ],
   "id": "cell-markdown-step5"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from pathlib import Path\n",
    "\n",
    "print(f'Loading {DATASET_PATH} ...')\n",
    "file_size_mb = Path(DATASET_PATH).stat().st_size / 1024 / 1024\n",
    "print(f'File size: {file_size_mb:.1f} MB')\n",
    "\n",
    "with open(DATASET_PATH, encoding='utf-8') as f:\n",
    "    data = json.load(f)\n",
    "\n",
    "# The heist file uses {'posts': [...]} structure with embedded comments\n",
    "posts = data.get('posts', []) if isinstance(data, dict) else data\n",
    "\n",
    "total_posts    = len(posts)\n",
    "total_comments = sum(len(p.get('comments', [])) for p in posts)\n",
    "\n",
    "print(f'\\n✅ Loaded!')\n",
    "print(f'  Posts:    {total_posts:,}')\n",
    "print(f'  Comments: {total_comments:,}')\n",
    "print(f'  Total:    {total_posts + total_comments:,} items to scan')\n",
    "print(f'  Expected: ~66,419 posts + ~70,595 comments = ~137,014 items')"
   ],
   "id": "cell-load"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6 — Harvest all injections\n",
    "\n",
    "Scans posts and comments **separately** so we know exactly where each injection appears.\n",
    "\n",
    "> ⏱️ ~137,000 items to scan — expect 2-5 minutes on Colab free tier."
   ],
   "id": "cell-markdown-step6"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "from datetime import datetime, timezone\n",
    "\n",
    "def get_author(obj):\n",
    "    a = obj.get('author') or obj.get('user') or {}\n",
    "    if isinstance(a, dict):\n",
    "        return a.get('name') or a.get('username') or 'unknown'\n",
    "    return str(a) or 'unknown'\n",
    "\n",
    "def scan_text(text):\n",
    "    \"\"\"Return list of (keyword, category) matches in text.\"\"\"\n",
    "    t = text.lower()\n",
    "    return [(kw, cat) for kw, cat in KW_TO_CAT.items() if kw in t]\n",
    "\n",
    "# ── Results containers ────────────────────────────────────────────────────────\n",
    "injections_found  = []          # full context records\n",
    "test_suite        = []          # clean test questions\n",
    "cat_counts        = defaultdict(int)\n",
    "kw_counts         = defaultdict(int)\n",
    "author_counts     = defaultdict(int)\n",
    "posts_with_inj    = set()\n",
    "comments_with_inj = 0\n",
    "\n",
    "print(f'Scanning {total_posts:,} posts + {total_comments:,} comments...')\n",
    "print()\n",
    "\n",
    "for i, post in enumerate(posts):\n",
    "    if i % 5000 == 0:\n",
    "        print(f'  {i:,}/{total_posts:,} posts — {len(injections_found):,} injections found so far')\n",
    "\n",
    "    post_id      = post.get('id', f'post_{i}')\n",
    "    post_author  = get_author(post)\n",
    "    post_title   = post.get('title', '') or ''\n",
    "    post_body    = post.get('content', '') or post.get('body', '') or ''\n",
    "    post_text    = f'{post_title} {post_body}'\n",
    "    post_created = post.get('created_at') or post.get('createdAt') or ''\n",
    "    submolt      = post.get('submolt', '') or ''\n",
    "\n",
    "    # ── Scan post body ────────────────────────────────────────────────────────\n",
    "    matches = scan_text(post_text)\n",
    "    if matches:\n",
    "        posts_with_inj.add(post_id)\n",
    "        for kw, cat in matches:\n",
    "            cat_counts[cat]       += 1\n",
    "            kw_counts[kw]         += 1\n",
    "            author_counts[post_author] += 1\n",
    "\n",
    "        record = {\n",
    "            'source':             'post',\n",
    "            'post_id':            post_id,\n",
    "            'post_title':         post_title[:120],\n",
    "            'author':             post_author,\n",
    "            'submolt':            submolt,\n",
    "            'created_at':         post_created,\n",
    "            'text':               post_body[:500],\n",
    "            'matched_keywords':   [kw for kw, _ in matches],\n",
    "            'matched_categories': list({cat for _, cat in matches}),\n",
    "            'upvotes':            post.get('upvotes', 0),\n",
    "            'comment_count':      len(post.get('comments', [])),\n",
    "        }\n",
    "        injections_found.append(record)\n",
    "\n",
    "        test_suite.append({\n",
    "            'id':         f'HEIST-POST-{len(test_suite)+1:05d}',\n",
    "            'source':     'post',\n",
    "            'author':     post_author,\n",
    "            'categories': list({cat for _, cat in matches}),\n",
    "            'keywords':   [kw for kw, _ in matches],\n",
    "            'payload':    post_body[:300],\n",
    "            'wrapper':    'direct',\n",
    "        })\n",
    "\n",
    "    # ── Scan each comment ─────────────────────────────────────────────────────\n",
    "    for j, comment in enumerate(post.get('comments', [])):\n",
    "        c_body    = comment.get('body', '') or comment.get('content', '') or ''\n",
    "        c_author  = get_author(comment)\n",
    "        c_id      = comment.get('id', f'{post_id}_c{j}')\n",
    "        c_created = comment.get('created_at') or comment.get('createdAt') or ''\n",
    "\n",
    "        c_matches = scan_text(c_body)\n",
    "        if c_matches:\n",
    "            comments_with_inj += 1\n",
    "            posts_with_inj.add(post_id)\n",
    "            for kw, cat in c_matches:\n",
    "                cat_counts[cat]       += 1\n",
    "                kw_counts[kw]         += 1\n",
    "                author_counts[c_author] += 1\n",
    "\n",
    "            record = {\n",
    "                'source':             'comment',\n",
    "                'post_id':            post_id,\n",
    "                'post_title':         post_title[:120],\n",
    "                'comment_id':         c_id,\n",
    "                'author':             c_author,\n",
    "                'submolt':            submolt,\n",
    "                'created_at':         c_created,\n",
    "                'text':               c_body[:500],\n",
    "                'matched_keywords':   [kw for kw, _ in c_matches],\n",
    "                'matched_categories': list({cat for _, cat in c_matches}),\n",
    "            }\n",
    "            injections_found.append(record)\n",
    "\n",
    "            test_suite.append({\n",
    "                'id':         f'HEIST-COMMENT-{len(test_suite)+1:05d}',\n",
    "                'source':     'comment',\n",
    "                'post_id':    post_id,\n",
    "                'author':     c_author,\n",
    "                'categories': list({cat for _, cat in c_matches}),\n",
    "                'keywords':   [kw for kw, _ in c_matches],\n",
    "                'payload':    c_body[:300],\n",
    "                'wrapper':    'direct',\n",
    "            })\n",
    "\n",
    "print(f'\\n✅ Scan complete!')\n",
    "print(f'   Total injections found : {len(injections_found):,}')\n",
    "print(f'   Posts with injections  : {len(posts_with_inj):,} / {total_posts:,} ({len(posts_with_inj)/total_posts*100:.2f}%)')\n",
    "print(f'   Injections in posts    : {len([r for r in injections_found if r[\"source\"]==\"post\"]):,}')\n",
    "print(f'   Injections in comments : {comments_with_inj:,}')"
   ],
   "id": "cell-harvest"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7 — Results summary"
   ],
   "id": "cell-markdown-step7"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "injection_rate = len(posts_with_inj) / total_posts * 100\n",
    "\n",
    "print('=' * 60)\n",
    "print('HEIST INJECTION HARVEST RESULTS')\n",
    "print('=' * 60)\n",
    "print(f'Posts scanned:           {total_posts:,}')\n",
    "print(f'Comments scanned:        {total_comments:,}')\n",
    "print(f'Total items scanned:     {total_posts + total_comments:,}')\n",
    "print(f'Posts with injections:   {len(posts_with_inj):,}  ({injection_rate:.2f}%)')\n",
    "print(f'Total injection records: {len(injections_found):,}')\n",
    "print(f'Test suite size:         {len(test_suite):,} payloads')\n",
    "print()\n",
    "\n",
    "print('By category:')\n",
    "for cat, count in sorted(cat_counts.items(), key=lambda x: -x[1]):\n",
    "    bar = '\\u2588' * min(count // max(1, max(cat_counts.values()) // 30), 30)\n",
    "    print(f'  {cat:<28} {count:6,}  {bar}')\n",
    "\n",
    "print()\n",
    "print('Top 10 injecting authors:')\n",
    "for author, count in sorted(author_counts.items(), key=lambda x: -x[1])[:10]:\n",
    "    print(f'  {author:<35} {count:5,} injections')\n",
    "\n",
    "print()\n",
    "print('Top 15 keywords:')\n",
    "for kw, count in sorted(kw_counts.items(), key=lambda x: -x[1])[:15]:\n",
    "    print(f'  {kw:<35} {count:5,}')\n",
    "\n",
    "# moltshellbroker specific\n",
    "msb_count = author_counts.get('moltshellbroker', 0)\n",
    "print()\n",
    "print(f'moltshellbroker injections: {msb_count:,}')\n",
    "print(f'moltshellbroker rate:       {msb_count/total_posts*100:.2f}% of all posts')"
   ],
   "id": "cell-summary"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 8 — Save output files"
   ],
   "id": "cell-markdown-step8"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "ts = datetime.now(timezone.utc).strftime('%Y%m%d_%H%M%S')\n",
    "\n",
    "# ── Build stats summary ───────────────────────────────────────────────────────\n",
    "stats = {\n",
    "    'harvested_at':            datetime.now(timezone.utc).isoformat(),\n",
    "    'dataset_name':            'The Heist B4 GTA6 — Full Moltbook Corpus',\n",
    "    'researcher':              'David Keane IR240474 — NCI MSc Cybersecurity',\n",
    "    'reference':               'Greshake et al. (2023) — Not What You Signed Up For',\n",
    "    'source_file':             'all_posts_1_2M.json',\n",
    "    'total_posts_scanned':     total_posts,\n",
    "    'total_comments_scanned':  total_comments,\n",
    "    'total_items_scanned':     total_posts + total_comments,\n",
    "    'posts_with_injections':   len(posts_with_inj),\n",
    "    'injection_rate_pct':      round(injection_rate, 2),\n",
    "    'total_injection_records': len(injections_found),\n",
    "    'injections_in_posts':     len([r for r in injections_found if r['source'] == 'post']),\n",
    "    'injections_in_comments':  comments_with_inj,\n",
    "    'test_suite_size':         len(test_suite),\n",
    "    'moltshellbroker_count':   msb_count,\n",
    "    'moltshellbroker_rate_pct': round(msb_count / total_posts * 100, 2),\n",
    "    'by_category':             dict(sorted(cat_counts.items(), key=lambda x: -x[1])),\n",
    "    'top_keywords':            dict(sorted(kw_counts.items(), key=lambda x: -x[1])[:50]),\n",
    "    'top_authors':             dict(sorted(author_counts.items(), key=lambda x: -x[1])[:20]),\n",
    "}\n",
    "\n",
    "# ── File paths ────────────────────────────────────────────────────────────────\n",
    "f_found = os.path.join(OUTPUT_DIR, 'heist_injections_found.json')\n",
    "f_suite = os.path.join(OUTPUT_DIR, 'heist_injections_test_suite.json')\n",
    "f_stats = os.path.join(OUTPUT_DIR, 'heist_injection_stats.json')\n",
    "\n",
    "# ── Save ──────────────────────────────────────────────────────────────────────\n",
    "with open(f_found, 'w', encoding='utf-8') as f:\n",
    "    json.dump(injections_found, f, ensure_ascii=False, indent=2)\n",
    "\n",
    "with open(f_suite, 'w', encoding='utf-8') as f:\n",
    "    json.dump({'metadata': stats, 'tests': test_suite}, f, ensure_ascii=False, indent=2)\n",
    "\n",
    "with open(f_stats, 'w', encoding='utf-8') as f:\n",
    "    json.dump(stats, f, ensure_ascii=False, indent=2)\n",
    "\n",
    "print(f'heist_injections_found.json      → {Path(f_found).stat().st_size/1024:.0f} KB  ({len(injections_found):,} records)')\n",
    "print(f'heist_injections_test_suite.json → {Path(f_suite).stat().st_size/1024:.0f} KB  ({len(test_suite):,} payloads)')\n",
    "print(f'heist_injection_stats.json       → {Path(f_stats).stat().st_size/1024:.0f} KB')\n",
    "print(f'\\nAll files saved to: {OUTPUT_DIR} ✅')"
   ],
   "id": "cell-save"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 9 — Upload to HuggingFace (optional)\n",
    "\n",
    "⚠️ Make sure `HF_TOKEN` is set in the Config cell above.\n",
    "\n",
    "These files go to `DavidTKeane/moltbook-ai-injection-dataset` as heist variants."
   ],
   "id": "cell-markdown-step9"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import HfApi\n",
    "\n",
    "if not HF_TOKEN:\n",
    "    print('\\u26a0\\ufe0f  HF_TOKEN not set — skipping upload.')\n",
    "    print('Set HF_TOKEN in the Config cell above and re-run to upload.')\n",
    "else:\n",
    "    api = HfApi(token=HF_TOKEN)\n",
    "    print(f'Uploading to {HF_REPO}...')\n",
    "\n",
    "    uploads = [\n",
    "        (f_stats, 'heist_injection_stats.json',\n",
    "         f'Add heist stats: {injection_rate:.1f}% injection rate across {total_posts:,} posts'),\n",
    "        (f_found, 'heist_injections_found.json',\n",
    "         f'Add heist harvest: {len(injections_found):,} injections from full corpus'),\n",
    "        (f_suite, 'heist_injections_test_suite.json',\n",
    "         f'Add heist test suite: {len(test_suite):,} real-world payloads'),\n",
    "    ]\n",
    "\n",
    "    for local_path, repo_path, commit_msg in uploads:\n",
    "        print(f'  Uploading {repo_path}...')\n",
    "        api.upload_file(\n",
    "            path_or_fileobj=local_path,\n",
    "            path_in_repo=repo_path,\n",
    "            repo_id=HF_REPO,\n",
    "            repo_type='dataset',\n",
    "            commit_message=commit_msg,\n",
    "        )\n",
    "        print(f'  \\u2705 {repo_path}')\n",
    "\n",
    "    print(f'\\nAll files uploaded to HuggingFace \\u2705')\n",
    "    print(f'https://huggingface.co/datasets/{HF_REPO}')"
   ],
   "id": "cell-upload"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 10 — CA2 Report Numbers\n",
    "\n",
    "Copy these numbers directly into your CA2 report.\n",
    "\n",
    "**Heist corpus vs original corpus comparison:**"
   ],
   "id": "cell-markdown-step10"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('=' * 65)\n",
    "print('CA2 REPORT — HEIST CORPUS KEY NUMBERS')\n",
    "print('=' * 65)\n",
    "print(f'Source file:          all_posts_1_2M.json (269 MB)')\n",
    "print(f'Dataset size:         {total_posts:,} posts + {total_comments:,} comments')\n",
    "print(f'Total items scanned:  {total_posts + total_comments:,}')\n",
    "print(f'Injection rate:       {injection_rate:.2f}% (posts containing injection patterns)')\n",
    "print(f'Posts affected:       {len(posts_with_inj):,} out of {total_posts:,}')\n",
    "print(f'Total injections:     {len(injections_found):,} individual injection instances')\n",
    "print(f'  In posts:           {len([r for r in injections_found if r[\"source\"]==\"post\"]):,}')\n",
    "print(f'  In comments:        {comments_with_inj:,}')\n",
    "print(f'Test suite size:      {len(test_suite):,} real-world payloads')\n",
    "print()\n",
    "print('moltshellbroker (commercial injection agent):')\n",
    "print(f'  Injections:         {msb_count:,}')\n",
    "print(f'  Rate:               {msb_count/total_posts*100:.2f}% of all posts')\n",
    "print()\n",
    "print('Injection by category:')\n",
    "total_inj = len(injections_found)\n",
    "for cat, count in sorted(cat_counts.items(), key=lambda x: -x[1]):\n",
    "    pct = count / total_inj * 100 if total_inj else 0\n",
    "    print(f'  {cat:<28} {count:6,}  ({pct:.1f}% of injections)')\n",
    "print()\n",
    "print('Platform comparison (for thesis):')\n",
    "print(f'  Moltbook (heist):   {injection_rate:.2f}% — Reddit-style AI-to-AI network')\n",
    "print(f'  4claw:              2.51% — 4chan-style AI agents')\n",
    "print(f'  Clawk:              0.50% — Twitter/X-style AI agents')\n",
    "print('=' * 65)\n",
    "print('Reference: Greshake et al. (2023) — arXiv:2302.12173')\n",
    "print('Researcher: David Keane IR240474 — NCI MSc Cybersecurity')\n",
    "print('\\\"Rangers lead the way!\\\"')"
   ],
   "id": "cell-ca2-numbers"
  }
 ]
}