CTIConnect

CTIConnect

A Benchmark for Retrieval-Augmented LLMs over Heterogeneous Cyber Threat Intelligence

Yutong Cheng1 · Yang Liu1 · Changze Li1 · Dawn Song2 · Peng Gao1
1Virginia Tech    2UC Berkeley
📄 Paper 💻 Code 📦 Dataset 📊 Leaderboard 📝 Cite
🏆 Leaderboard

Performance across 10 LLMs

Per-task F1 across three retrieval configurations: CB (Closed-Book), VR (Vanilla RAG), and DS (Domain-Specific strategy: EtR for EL, DtR for EA, CSKG-guided for MDS). Multi-Doc Synthesis omits CB as the task inherently requires multi-document retrieval.

Click any task column header to sort. Bars within each cell are proportional to F1. Best per column is highlighted in red.

Benchmark Overview

Bridging the cross-source semantic gap in CTI

CTI knowledge spans structured taxonomies (CVE, CWE, CAPEC, MITRE ATT&CK) and unstructured vendor reports. Queries expressed in one vocabulary systematically fail to match relevant evidence in another — causing vanilla embedding-based retrieval to break down. CTIConnect is the first retrieval-augmented benchmark that systematically evaluates this gap across the full CTI task landscape.

We integrate 5 heterogeneous CTI sources into 691 expert-curated QA pairs across 9 tasks in 3 categories. Each category is paired with a domain-specific retrieval strategy and evaluated against vanilla RAG baselines on 10 state-of-the-art LLMs.

Unstructured vendor threat reports · 321 reports from 35+ sources

…and 30+ more vendor and security-news sources (BleepingComputer, Dark Reading, The Hacker News, Symantec, Kaspersky, ESET, Check Point, Microsoft, NCC Group, etc.)

The Nine Tasks

Nine cross-source operations

Organized into three categories based on the source types they bridge. Together, they cover all cross-source directions in CTI analysis.

Retrieval Strategies

Domain-specific strategies for each semantic gap

Each task category exhibits a distinct manifestation of the cross-source semantic gap. We design a domain-specific retrieval strategy for each, evaluated against Closed-Book and Vanilla RAG baselines.

EL

Extract-then-Retrieve (EtR)

The LLM parses the input query to extract security-relevant attributes (vulnerability descriptions, weakness indicators, technical identifiers) and canonicalizes them into structured search keys aligned with the target KB. Multi-modal retrieval combines semantic and exact matching.

query → extract & canonicalize → semantic + exact retrieve
EA

Decompose-then-Retrieve (DtR)

Decomposes the input passage into M atomic behaviors, canonicalizes each into taxonomy-aligned vocabulary, and performs independent retrieval per behavior before aggregation. Addresses the vocabulary mismatch between narrative prose and formal taxonomy entries.

passage → decompose → per-behavior retrieve → aggregate
MDS

CSKG-Guided RAG

Builds a Cybersecurity Knowledge Graph offline by extracting named entities from each report. At query time, entities are extracted from the input and corpus reports are ranked by entity overlap rather than embedding similarity — bypassing alias-driven retrieval failures.

extract entities → overlap match → retrieve top-k reports
Key Findings

What 10 LLMs × 9 tasks × 3 configs revealed

Domain-specific retrieval beats vanilla RAG

Tailored strategies yield substantial gains over generic embedding retrieval — +35.2% for Entity Linking, +16.0% for Entity Attribution, and +11.3% for Multi-Doc Synthesis — with the magnitude depending on the semantic gap of each category.

The semantic gap is task-asymmetric

Query-to-gold vs. query-to-top-1 cosine gap by category: 0.06 for EL, 0.31 for EA, 0.43 for MDS. EL is near-saturated, EA suffers vocabulary mismatch, and MDS is dominated by alias-driven near-miss distractors.

The bottleneck shifts by category

For EL and MDS the bottleneck is retrieval infrastructure; for EA it shifts to model reasoning. Intra-family scaling benefits concentrate in EA — LLaMA-3 8B→405B gains +29.5% on attribution alone.

No single model dominates

GPT-5 leads overall at 81.4%, but the next three (Qwen-3-235B, GPT-4o, Claude-Sonnet-4) cluster within 1%. Qwen-3-235B even tops EL with perfect F1 on three tasks — CTIConnect discriminates along multiple capability axes.

Dataset

Construction pipeline & data composition

Benchmark construction pipeline
Data sourceTasksQAGround truth
Structured MappingsRCM, WIM, ATD, ESD400Official KB links
Report ClustersTAP, MLA, CSC141Manual clustering
B2F AlignmentsATA, VCA150Dual annotation
Total9 tasks691

All 691 QA pairs are produced through a three-stage pipeline ensuring authoritative ground truth and quality control.

  1. Seed correlation annotation. Derived from official MITRE/NVD cross-source mappings for Entity Linking, and dual expert annotation with senior adjudication for Entity Attribution and Multi-Doc Synthesis.
  2. Template-constrained QA synthesis. Each correlation is transformed into task-specific QA via prompt templates that specify the instruction, input entity, expected output, and required format — grounding generation in verified correlations to reduce hallucination.
  3. LLM–human collaborative curation. A GPT-4 judge filters low-confidence samples; two domain practitioners independently verify each pair; a senior annotator with 3+ years of CTI experience adjudicates final inclusion.
Cite

BibTeX

If you find CTIConnect useful in your research, please cite our paper.

@inproceedings{cheng2026cticonnect,
  title     = {CTIConnect: A Benchmark for Retrieval-Augmented LLMs over Heterogeneous Cyber Threat Intelligence},
  author    = {Cheng, Yutong and Liu, Yang and Li, Changze and Song, Dawn and Gao, Peng},
  booktitle = {Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '26)},
  year      = {2026},
  publisher = {ACM},
  doi       = {10.1145/3770855.3817527}
}