Strongest overall posture today. Investments in tool-call governance and audit-trail surfaces are visible. Per-call capability leases are the gap.
OWASP Agentic Top-10 — Commercial Platform Benchmark
10 commercial AI-agent platforms scored against the OWASP Agentic Top-10 by an independent operator. Methodology, repro repo, dispute channel, and quarterly refresh commitment. Built on agent-audit-kit telemetry + public-docs review.
Scorecard
Each cell: 0–3. Total per platform: out of 30. Hover a cell to see the one-line justification. Sorted by total descending.
| Platform | A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Claude Managed Agents Anthropic · Managed Agent | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 22 |
| Amp Sourcegraph (spinout) · Agent IDE | 2 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 2 | 18 |
| Cursor Agent Anysphere · Agent IDE | 2 | 2 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 2 | 16 |
| Windsurf Codeium · Agent IDE | 2 | 2 | 1 | 1 | 2 | 2 | 1 | 2 | 1 | 2 | 16 |
| Augment Code Augment Computing · Agent IDE | 2 | 2 | 1 | 1 | 2 | 2 | 1 | 2 | 1 | 2 | 16 |
| Devin Cognition Labs · Managed Agent | 1 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 2 | 13 |
| Continue.dev Continue · Agent IDE | 1 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 2 | 13 |
| Cline Cline · Agent IDE | 1 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 12 |
| Bolt.new StackBlitz · Managed Agent | 1 | 2 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 2 | 10 |
| Replit Agent Replit · Managed Agent | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 2 | 9 |
Platform commentary
Multi-agent decomposition is structurally the closest to a per-call capability-lease pattern in the IDE class. Weakest on grounding enforcement.
IDE-class agent. Human-in-the-loop diff review is the load-bearing safety surface. Excellent for code; capability-lease story underdeveloped.
Enterprise-flavoured IDE class. Strong defaults on disclosure + identity; capability-lease story still per-feature, not per-call.
Enterprise-tier IDE assistant. Strong on identity + isolation. Capability-lease pattern still implicit.
Sandbox is the load-bearing control. Strong on isolation, weaker on per-call capability leasing and indirect-injection.
Open-source posture pushes responsibility to the operator. Excellent transparency. Weakest where corporate procurement wants default-on controls.
MCP-native open-source agent. Power-user-friendly. Lowest default-on protection — by design.
Different category — generation, not execution. Lower-stakes surface, lower scores on disclosure-class risks.
Optimized for ship-fast prototyping. Lower-stakes surface; lower default-on protection. Capability lease basically absent.
OWASP Agentic Top-10 reference
- A1 — Prompt Injection
- Direct + indirect injection across user-controlled inputs, retrieved context, and tool outputs.
- A2 — Sensitive Info Disclosure
- Agent leaks secrets, PII, or training-set artifacts through outputs or tool calls.
- A3 — Supply-Chain Vulnerabilities
- Compromised models, MCP servers, plugins, or dependency graphs.
- A4 — Data + Model Poisoning
- Adversarial inputs into fine-tuning, RAG corpus, eval set, or memory store.
- A5 — Improper Output Handling
- Downstream system trusts agent output without validation (SQL, shell, browser, network).
- A6 — Excessive Agency
- Overbroad tool inventories, long-lived keys, no allow-listing, no human-in-the-loop on irreversible ops.
- A7 — System Prompt Leakage
- System prompt exfiltration via user input, function-calling, or memory injection.
- A8 — Vector + Embedding Weakness
- Embedding-store poisoning, retrieval-based prompt injection, multi-tenant leakage.
- A9 — Misinformation
- Confident hallucinations propagated into downstream systems without grounding or citation.
- A10 — Unbounded Consumption
- Cost / latency / token attacks; agent loops; budget-router absence.
Authoritative source: OWASP Top-10 for LLM Applications + Generative AI (2026).
Methodology
Inputs (in priority order): (1) Live agent-audit-kit scan against publicly-available SDK / OSS components of each platform. (2) Vendor public documentation, blog posts, and security pages. (3) Reproducible behavioural probes — small adversarial inputs that test the documented protection — repro scripts live in the repo. (4) Conversations with security engineers at vendors (when willing).
What I do NOT score on:private red-team results, non-reproducible anecdotes, or marketing slides. If you can't reproduce a claim from a repro recipe, it doesn't move the score.
Scoring band rationale:bands (0/1/2/3) not decimals. Decimals invite false precision and rank-quibbling. Bands force "is this default-on or not."
Refresh cadence: Quarterly. Each release ships with a changelog showing which scores moved and why. v1.0 published 2026-05-23. v1.1 due 2026-08-15.
Disputes
If you work on one of these platforms and disagree with a score, here's the channel:
- Open a GitHub issue on the benchmark repo with the platform + family + your evidence.
- I run your repro recipe. If it changes the band, the score moves in the next quarterly refresh — with attribution.
- Vendor-marketed claims don't move scores. Repro recipes do.
Cite this
If you reference scores in writing, please cite as:
Sattyam Jain. "OWASP Agentic Top-10 — Commercial Platform Benchmark
v1.0." sattyamjjain.in, 2026-05-23.
https://www.sattyamjjain.in/benchmark/owasp-agentic-2026