Passer au contenu principal

Behavioral AI vs. General-Purpose LLMs: A Controlled Email Threat Detection Benchmark and Methodology

How the Abnormal detection system compares with six frontier models—run as single-pass email classifiers—on detection accuracy, cost, and speed. Includes methodology, benchmark results, and cost/latency analysis.

Executive Summary

General-purpose LLMs aren't as effective as purpose-built behavioral AI cybersecurity solutions. This paper documents a controlled head-to-head comparison—the Abnormal detection system against six frontier models—so you can see exactly how and why.

We tested Claude Opus 4.6/4.8, Sonnet 4.5, Haiku 4.5, GPT-4.1, and GPT-5 Mini, each used as a single-pass email classifier on the same 1,000 confirmed attacks and 1,000 confirmed-safe messages. Every message was labeled by Abnormal's own threat analysts, whose expert review is the reference standard throughout. Each model was given the same task: read the message, return a verdict.

On cost and latency, the results are structural—these gaps are large, consistent, and insensitive to prompt or configuration:

  • 300–6,500× more expensive per million messages. Frontier models cost between $6,342 and $130,093 per million messages as classifiers, compared to ~$20 for Abnormal.
  • 19–79× slower to return a verdict. Abnormal returns a decision in approximately 0.4 seconds. Frontier models ranged from roughly 7 to 32 seconds on the same task.
  • Of the attacks Abnormal caught, frontier models would have missed between 46% and 96%. The best single-pass model caught 54% of confirmed attacks; the weakest flagged fewer than 1 in 20.
  • No frontier model correctly identified even half of truly safe messages as safe. The best passed 46% of legitimate emails correctly; the worst, just 11%. Abnormal's false-positive rate on this set: ~1%.

What the numbers mean. Detection is a two-axis problem: catch attacks while leaving legitimate mail alone. These two objectives pull in opposite directions, and no single-pass model found a configuration that navigated both. Abnormal's multi-stage behavioral AI uses organizational context—behavioral baselines, sender relationship graphs, and campaign-level signals—that a message-level LLM call cannot replicate.

Bottom line. No frontier model used as a drop-in classifier comes close to what purpose-built behavioral AI delivers on detection accuracy, cost, or speed.

Methodology

We compared the Abnormal detection system—a multi-stage pipeline combining behavioral models, downstream rules, and an LLM critic—against six frontier models each run as a single-pass classifier. Each model received one message and returned one verdict with no pipeline or surrounding context.

Working from real production mail, we built two test sets of 1,000 messages each, every one labeled by Abnormal's own threat analysts:

  • Attack set — 1,000 confirmed attacks. Real threats spanning BEC, VEC, executive impersonation, credential phishing, invoice fraud, and high-volume commodity attacks.
  • Safe set — 1,000 confirmed-safe messages. Legitimate mail including internal correspondence, vendor communications, newsletters, and transactional notifications.

For each system we recorded five metrics: attacks flagged (recall), safe mail correctly passed (clean-pass rate), safe mail wrongly flagged (false-positive rate), cost per million messages, and decision latency at the 90th percentile.

01 — The Question

General-purpose LLMs are extraordinary generalists, and a fair question for any specialized security vendor is whether a frontier model could simply replace a dedicated detection system. Email security is a useful place to test that, because it is a narrow, high-stakes decision made billions of times a day: a wrong call either lets an attack reach a person or buries an analyst in false alarms.

This paper reports a controlled experiment built to probe one version of that question—could a single frontier-model call, dropped in as a classifier, do the job the Abnormal detection system does?

02 — Experimental Design

2.1 Systems Compared

We compared the Abnormal detection system against a single model call. Abnormal takes an email, runs it through a behavioral detection engine and a stack of AI models, and returns a decision. Each frontier model was given the same job in a single pass.

Abnormal detection system: Email → Behavioral detection engine + AI models → Decision: attack / spam / graymail / safe

Frontier LLM — single-pass: One model call: message in → ATTACK / SPAM / GRAYMAIL / SAFE out (no surrounding pipeline)

2.2 The Test Sets

Attack set — 1,000 confirmed attacks (human-confirmed; broad range of real threats): Business email compromise (BEC), vendor email compromise (VEC), executive impersonation, credential phishing, invoice and payment fraud, and simpler high-volume attacks.

Safe set — 1,000 confirmed-safe messages (human-confirmed legitimate mail): Internal correspondence, vendor and partner communications, newsletters and marketing, and transactional notifications.

Both sets are drawn from real production mail, labeled by expert human review.

2.3 Metrics

  • Attacks flagged (recall) — of the 1,000 confirmed attacks, the share each model labeled ATTACK.
  • Safe mail kept (clean-pass) — of the 1,000 confirmed-safe messages, the share each model labeled SAFE.
  • Safe wrongly flagged (false-positive rate) — the share flagged as ATTACK.

03 — Results

SystemAttacks flaggedAttacks missedSafe mail keptSafe wrongly flaggedCost / 1M (p90)Latency vs. system
Abnormal99%†1%†99%†~1%†~$20
Opus 4.853.9%46.1%40.0%53.8%$130,09334×
Opus 4.649.0%51.0%45.7%48.4%$105,35433×
Haiku 4.547.5%52.5%11.5%88.0%$21,69919×
Sonnet 4.544.4%55.6%17.6%80.4%$79,92979×
GPT-4.113.6%86.4%19.1%77.8%$29,20020×
GPT-5 Mini4.5%95.5%22.0%64.1%$6,34240×

† System efficacy reflects this focused evaluation set. Latency is relative to ~0.4s p90. "Safe mail kept" is the share labeled SAFE; "safe wrongly flagged" is the share flagged as ATTACK—these don't sum to 100%, as the remainder is sorted into spam/graymail.

04 — What the Numbers Mean

Read the efficacy results as indicative. On this focused set, no single-pass model matched the system's balance of catching attacks while leaving legitimate mail alone—the strongest flagged about half the attacks, and every model over-flagged a large share of safe mail.

As a drop-in single-pass classifier on this set, a frontier model does not match the Abnormal detection system's balance of catching attacks and leaving legitimate mail alone—at roughly 300–6,500× the cost and 19–79× the latency. The cost and latency gaps are robust to prompt and configuration; the efficacy comparison reflects this evaluation set, not production performance.

05 — How to Read These Results

5.1 A Focused Evaluation Set

This is a focused evaluation set, not representative inbox traffic. In production, traffic distribution shifts constantly—every environment sees a different mix—so no single benchmark number generalizes. We treat the efficacy comparison as indicative and validate it separately.

5.2 Configuration Matters

Each model was run as a single-pass classifier under detection-oriented prompting. Results depend on prompt and configuration and would differ under other setups.

5.3 Robust vs. Set-Specific

The cost and latency gaps are large and insensitive to prompt and dataset choices—they are the findings to lean on. The efficacy comparison reflects this evaluation set, validated on representative production traffic as a separate, continuous effort.

5.4 Variance & Contamination

Efficacy figures are computed over 1,000-message sets per class. Because evaluation messages are real production mail rather than a published benchmark, the risk of overlap with model training data is low.

07 — Conclusion

On this focused evaluation set, no frontier model run as a single-pass classifier matched the Abnormal detection system's balance of catching attacks and leaving legitimate mail alone. That result reflects this evaluation set—including the system's ~99% figures—which we validate separately on representative production traffic.

The findings that do travel are the structural ones. A frontier model used as a drop-in classifier costs roughly two to four orders of magnitude more to run and returns a verdict tens of times slower. Those gaps are large, consistent, and insensitive to configuration. The answer from this work: not as a single-pass swap, on cost and latency grounds alone—and on detection, the system holds its balance against every model on this set.

Appendix — Glossary

  • Single-pass classifier — A model given one message and asked for one verdict, with no surrounding pipeline.
  • Recall (attacks flagged) — Of confirmed attacks, the share a system labels as an attack.
  • Clean-pass rate — Of confirmed-safe mail, the share a model labeled SAFE.
  • False-positive rate — Of confirmed-safe mail, the share a model flagged as an ATTACK.
  • Operating point — A single (recall, precision) position a classifier occupies at a given threshold or prompt.
  • p90 — The 90th-percentile value; 90% of runs were at or below it. Used here for cost and latency.


Based on Abnormal internal benchmarking, March 1, 2026 - June 16, 2026. We compared Abnormal's purpose-built behavioral detection system with six general-purpose large language models—Claude Opus 4.8, Claude Opus 4.6, Claude Sonnet 4.5, Claude Haiku 4.5, GPT-4.1, and GPT-5 Mini—each run as a single-pass email classifier under detection-oriented prompting, evaluated against two balanced sets of 1,000 real production messages each - 1,000 confirmed attacks and 1,000 confirmed-safe samples, corpus design reviewed by the Abnormal Detection team, scored against expert human review by Abnormal threat analysts whose judgments are the reference standard throughout. Detection, cost, and speed were selected as the primary measures of security effectiveness and operational efficiency for enterprise email threat detection. Cost and latency figures are p90 measurements from the runs and are not sensitive to dataset composition, prompt, or configuration. Detection figures reflect a focused evaluation set and are illustrative of this test rather than a guarantee of production results, which vary by environment and over time. LLM results depend on prompt and configuration; multiples are relative to the specific models and settings shown. Claude and GPT are trademarks of their respective owners; comparisons are for informational purposes.

See Abnormal in Action

See how behavioral AI detects the attacks that legacy defenses miss.