Benchmark · v1.0

Published 2026-05-23

Next refresh 2026-08-15

OWASP Agentic Top-10 — Commercial Platform Benchmark

10 commercial AI-agent platforms scored against the OWASP Agentic Top-10 by an independent operator. Methodology, repro repo, dispute channel, and quarterly refresh commitment. Built on agent-audit-kit telemetry + public-docs review.

v1.0 scores are working estimates — each cell links to a repro recipe in the repo. Vendor pushback welcome via the dispute channel below. Bands (not absolutes) are what matter.

Methodology + repro repo See agent-audit-kit (the scanner behind this)

Scorecard

Each cell: 0–3. Total per platform: out of 30. Hover a cell to see the one-line justification. Sorted by total descending.

Platform	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	Total
Claude Managed Agents Anthropic · Managed Agent	2	3	2	2	2	2	2	2	2	3	22
Amp Sourcegraph (spinout) · Agent IDE	2	2	2	1	2	2	2	2	1	2	18
Cursor Agent Anysphere · Agent IDE	2	2	1	1	2	2	2	1	1	2	16
Windsurf Codeium · Agent IDE	2	2	1	1	2	2	1	2	1	2	16
Augment Code Augment Computing · Agent IDE	2	2	1	1	2	2	1	2	1	2	16
Devin Cognition Labs · Managed Agent	1	2	1	1	1	1	2	1	1	2	13
Continue.dev Continue · Agent IDE	1	2	1	1	2	1	1	1	1	2	13
Cline Cline · Agent IDE	1	2	1	1	2	1	1	1	1	1	12
Bolt.new StackBlitz · Managed Agent	1	2	1	0	1	1	1	1	0	2	10
Replit Agent Replit · Managed Agent	1	1	1	0	1	1	1	1	0	2	9

0No coverage

1Partial / docs only

2Default-on, escape hatches

3Default-on, audited

Platform commentary

Claude Managed Agents

Anthropic · Managed Agent

22/30

Strongest overall posture today. Investments in tool-call governance and audit-trail surfaces are visible. Per-call capability leases are the gap.

Amp

Sourcegraph (spinout) · Agent IDE

18/30

Multi-agent decomposition is structurally the closest to a per-call capability-lease pattern in the IDE class. Weakest on grounding enforcement.

Cursor Agent

Anysphere · Agent IDE

16/30

IDE-class agent. Human-in-the-loop diff review is the load-bearing safety surface. Excellent for code; capability-lease story underdeveloped.

Windsurf

Codeium · Agent IDE

16/30

Enterprise-flavoured IDE class. Strong defaults on disclosure + identity; capability-lease story still per-feature, not per-call.

Augment Code

Augment Computing · Agent IDE

16/30

Enterprise-tier IDE assistant. Strong on identity + isolation. Capability-lease pattern still implicit.

Devin

Cognition Labs · Managed Agent

13/30

Sandbox is the load-bearing control. Strong on isolation, weaker on per-call capability leasing and indirect-injection.

Continue.dev

Continue · Agent IDE

13/30

Open-source posture pushes responsibility to the operator. Excellent transparency. Weakest where corporate procurement wants default-on controls.

Cline

Cline · Agent IDE

12/30

MCP-native open-source agent. Power-user-friendly. Lowest default-on protection — by design.

Bolt.new

StackBlitz · Managed Agent

10/30

Different category — generation, not execution. Lower-stakes surface, lower scores on disclosure-class risks.

Replit Agent

Replit · Managed Agent

9/30

Optimized for ship-fast prototyping. Lower-stakes surface; lower default-on protection. Capability lease basically absent.

OWASP Agentic Top-10 reference

A1 — Prompt Injection: Direct + indirect injection across user-controlled inputs, retrieved context, and tool outputs.
A2 — Sensitive Info Disclosure: Agent leaks secrets, PII, or training-set artifacts through outputs or tool calls.
A3 — Supply-Chain Vulnerabilities: Compromised models, MCP servers, plugins, or dependency graphs.
A4 — Data + Model Poisoning: Adversarial inputs into fine-tuning, RAG corpus, eval set, or memory store.
A5 — Improper Output Handling: Downstream system trusts agent output without validation (SQL, shell, browser, network).
A6 — Excessive Agency: Overbroad tool inventories, long-lived keys, no allow-listing, no human-in-the-loop on irreversible ops.
A7 — System Prompt Leakage: System prompt exfiltration via user input, function-calling, or memory injection.
A8 — Vector + Embedding Weakness: Embedding-store poisoning, retrieval-based prompt injection, multi-tenant leakage.
A9 — Misinformation: Confident hallucinations propagated into downstream systems without grounding or citation.
A10 — Unbounded Consumption: Cost / latency / token attacks; agent loops; budget-router absence.

Authoritative source: OWASP Top-10 for LLM Applications + Generative AI (2026).

Methodology

Inputs (in priority order): (1) Live agent-audit-kit scan against publicly-available SDK / OSS components of each platform. (2) Vendor public documentation, blog posts, and security pages. (3) Reproducible behavioural probes — small adversarial inputs that test the documented protection — repro scripts live in the repo. (4) Conversations with security engineers at vendors (when willing).

What I do NOT score on:private red-team results, non-reproducible anecdotes, or marketing slides. If you can't reproduce a claim from a repro recipe, it doesn't move the score.

Scoring band rationale:bands (0/1/2/3) not decimals. Decimals invite false precision and rank-quibbling. Bands force "is this default-on or not."

Refresh cadence: Quarterly. Each release ships with a changelog showing which scores moved and why. v1.0 published 2026-05-23. v1.1 due 2026-08-15.

Disputes

If you work on one of these platforms and disagree with a score, here's the channel:

Open a GitHub issue on the benchmark repo with the platform + family + your evidence.
I run your repro recipe. If it changes the band, the score moves in the next quarterly refresh — with attribution.
Vendor-marketed claims don't move scores. Repro recipes do.

Email me directly

Cite this

If you reference scores in writing, please cite as:

Sattyam Jain. "OWASP Agentic Top-10 — Commercial Platform Benchmark
v1.0." sattyamjjain.in, 2026-05-23.
https://www.sattyamjjain.in/benchmark/owasp-agentic-2026