A Benchmark for Retrieval-Augmented LLMs over Heterogeneous Cyber Threat Intelligence
Per-task F1 across three retrieval configurations: CB (Closed-Book), VR (Vanilla RAG), and DS (Domain-Specific strategy: EtR for EL, DtR for EA, CSKG-guided for MDS). Multi-Doc Synthesis omits CB as the task inherently requires multi-document retrieval.
Click any task column header to sort. Bars within each cell are proportional to F1. Best per column is highlighted in red.
CTI knowledge spans structured taxonomies (CVE, CWE, CAPEC, MITRE ATT&CK) and unstructured vendor reports. Queries expressed in one vocabulary systematically fail to match relevant evidence in another — causing vanilla embedding-based retrieval to break down. CTIConnect is the first retrieval-augmented benchmark that systematically evaluates this gap across the full CTI task landscape.
We integrate 5 heterogeneous CTI sources into 691 expert-curated QA pairs across 9 tasks in 3 categories. Each category is paired with a domain-specific retrieval strategy and evaluated against vanilla RAG baselines on 10 state-of-the-art LLMs.
Structured knowledge bases
Organized into three categories based on the source types they bridge. Together, they cover all cross-source directions in CTI analysis.
Each task category exhibits a distinct manifestation of the cross-source semantic gap. We design a domain-specific retrieval strategy for each, evaluated against Closed-Book and Vanilla RAG baselines.
The LLM parses the input query to extract security-relevant attributes (vulnerability descriptions, weakness indicators, technical identifiers) and canonicalizes them into structured search keys aligned with the target KB. Multi-modal retrieval combines semantic and exact matching.
Decomposes the input passage into M atomic behaviors, canonicalizes each into taxonomy-aligned vocabulary, and performs independent retrieval per behavior before aggregation. Addresses the vocabulary mismatch between narrative prose and formal taxonomy entries.
Builds a Cybersecurity Knowledge Graph offline by extracting named entities from each report. At query time, entities are extracted from the input and corpus reports are ranked by entity overlap rather than embedding similarity — bypassing alias-driven retrieval failures.
Tailored strategies yield substantial gains over generic embedding retrieval — +35.2% for Entity Linking, +16.0% for Entity Attribution, and +11.3% for Multi-Doc Synthesis — with the magnitude depending on the semantic gap of each category.
Query-to-gold vs. query-to-top-1 cosine gap by category: 0.06 for EL, 0.31 for EA, 0.43 for MDS. EL is near-saturated, EA suffers vocabulary mismatch, and MDS is dominated by alias-driven near-miss distractors.
For EL and MDS the bottleneck is retrieval infrastructure; for EA it shifts to model reasoning. Intra-family scaling benefits concentrate in EA — LLaMA-3 8B→405B gains +29.5% on attribution alone.
GPT-5 leads overall at 81.4%, but the next three (Qwen-3-235B, GPT-4o, Claude-Sonnet-4) cluster within 1%. Qwen-3-235B even tops EL with perfect F1 on three tasks — CTIConnect discriminates along multiple capability axes.
| Data source | Tasks | QA | Ground truth |
|---|---|---|---|
| Structured Mappings | RCM, WIM, ATD, ESD | 400 | Official KB links |
| Report Clusters | TAP, MLA, CSC | 141 | Manual clustering |
| B2F Alignments | ATA, VCA | 150 | Dual annotation |
| Total | 9 tasks | 691 | — |
All 691 QA pairs are produced through a three-stage pipeline ensuring authoritative ground truth and quality control.
If you find CTIConnect useful in your research, please cite our paper.
@inproceedings{cheng2026cticonnect,
title = {CTIConnect: A Benchmark for Retrieval-Augmented LLMs over Heterogeneous Cyber Threat Intelligence},
author = {Cheng, Yutong and Liu, Yang and Li, Changze and Song, Dawn and Gao, Peng},
booktitle = {Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '26)},
year = {2026},
publisher = {ACM},
doi = {10.1145/3770855.3817527}
}