Back to all articles
SecurityAIBenchmarkXSSEvaluation

Evaluating AI Agents in Identifying Web Vulnerabilities

Benchmarking 7 leading AI models on XSS detection across 5 difficulty levels — measuring detection accuracy, token efficiency, speed, and tool call behavior.

Evaluating AI Agents in Identifying Web Vulnerabilities

You can't improve what you don't measure. — often attributed to Peter Drucker

In this first article of our series, we share research and testing results to answer a key question: Can large language models help improve automated security testing?

We benchmarked leading AI models on a classic web vulnerability: Cross-Site Scripting (XSS). Tests covered five levels of difficulty, from basic reflected XSS to complex DOM manipulation that triggers JavaScript execution.

7Models Testedacross 3 AI providers
5XSS Levelsincreasing difficulty
100%Best DetectionClaude Sonnet · Gemini Pro/Flash
~40KMost Efficienttokens/test — Gemini Flash

Why Evaluate

Evaluations (or evals) are essential for any AI agent system. They measure how well a system performs a task and help answer questions like:

  • "How good is my agent at this task?"
  • "Will a new tool improve or reduce my agent's performance?"

Without a structured eval, model upgrades, prompt changes, and tooling decisions become guesswork. Evals turn intuition into measurement.


Methodology

We conducted the benchmark using the XSS Game challenge, which has five levels of increasing difficulty for detecting XSS vulnerabilities. Each model was tested on all levels, and we collected the following metrics:

  • Vulnerability Detection Rate — percentage of tests where vulnerabilities were correctly identified
  • Token Usage — total and average tokens used per test (input + output)
  • Task Duration — time taken to complete each test
  • Tool Calls — number of tool invocations (HTTP requests) made during testing

Our agents used a customized version of the ReAct pattern to structure their reasoning and approach.

Models tested:

ModelProviderTier
Claude SonnetAnthropicFrontier
Claude HaikuAnthropicEfficient
GPT-5.1OpenAIFrontier
GPT-5-miniOpenAIEfficient
GPT-5-nanoOpenAINano
Gemini ProGoogleFrontier
Gemini FlashGoogleEfficient

Results

Vulnerability Detection Rate

The vulnerability detection rate measures the percentage of tests in which the agent successfully identified XSS vulnerabilities. This metric directly reflects an agent's effectiveness in real-world security testing scenarios.

Vulnerability detection rate by AI model
  • Both Gemini Pro, Flash, and Claude Sonnet achieved 100% accuracy, closely followed by GPT-5.1 and Claude Haiku.
  • Smaller OpenAI models such as GPT-5-mini (60%) and GPT-5-nano (20%) showed lower accuracy.

This demonstrates a clear tradeoff between performance and resource usage, highlighting which models are suitable for large-scale automated testing.


Cost Analysis: Token Usage

Token usage directly impacts API costs. We measured both total token consumption across all tests and average tokens per test to identify the most cost-effective model for security testing.

Token usage per model across all XSS levels

Gemini variants offered the best balance of performance and cost, forming the Pareto frontier for efficiency. Anthropic models, such as Sonnet, were roughly 6× more expensive per sample than Gemini models while delivering similar detection performance.


Task Duration

Task duration measures the time an agent takes to complete each test. Faster models allow more comprehensive testing within tight schedules, increasing overall assessment throughput.

Task duration per model across XSS levels

As expected, smaller (or rather cheaper) models typically had lower request latency and general wall-clock task duration than their larger model family counterparts — but Gemini Flash's combination of speed and accuracy stood out. Gemini Flash completed tasks the fastest at ~160s average. It achieved this while also clocking fewer tool calls per run (see Tool Efficiency section below).


Tool Call Efficiency

Tool calls represent the agent's interactions with testing tools (e.g., HTTP requests). Fewer tool calls may indicate more efficient reasoning, while more tool calls might suggest thorough exploration of attack vectors.

Average tool calls per test by model

Claude variants averaged ~15 tool calls per test, whereas Gemini variants averaged only ~6. This demonstrates that Gemini agents can achieve high precision while minimizing unnecessary tool usage, saving both time and resources.


Accuracy vs. Cost Tradeoff

This analysis compares accuracy (vulnerability detection rate) with cost (average tokens per test) to help teams select the optimal model based on budget and performance needs.

Accuracy vs cost tradeoff scatter chart
  • Models in the upper-left quadrant, such as Gemini variants and GPT-5.1, offer the best value — combining high accuracy with low cost.
  • Claude models achieve high accuracy but at a significantly higher cost.

Tool Calls vs. Accuracy Tradeoff

This chart illustrates the relationship between tool call efficiency and accuracy. This metric is important for understanding how effectively each model reasons and plans its testing strategy — fewer tool calls often indicate better problem-solving capabilities and more strategic thinking.

Tool calls vs accuracy tradeoff scatter chart

Models with high accuracy and few tool calls — like GPT-5.1 and Gemini variants — are the most efficient, solving tasks with minimal tool interactions. Anthropic models required more tool calls to achieve similar accuracy, indicating less efficient task planning.


Conclusion

Our tests show clear differences between models depending on what matters most.

  • For pure detection performance: Claude Sonnet and Gemini Pro/Flash reached the top with a 100% detection rate.
  • For efficiency: Gemini Flash used the fewest resources, averaging about 40K tokens per test. It was also the fastest model, completing tests in about 24 seconds on average.

These results show that the best model depends on your priority: maximum accuracy, lowest cost, or highest speed. They also confirm that modern AI agents can already handle complex vulnerability testing tasks in a reliable and measurable way.


Early Access — Try What We Built

We turned these insights into a platform designed to remove the limits of traditional security testing. It uses optimized agents, fast execution, and smart resource use to deliver accurate results in a fraction of the time.

Join our waiting list to be among the first companies to test it and see how automated AI testing can simplify your security workflow while saving time and cost.


Hypersec Research Team

Hypersec Research Team

Information Security Research