How Much Data Do You Need to Measure AI Visibility with Confidence?

Nick Clark

11 Feb 2026 — 5 min read

Recent research from SparkToro and Gumshoe examined how consistent AI tools are when recommending brands and products. We found that individual AI responses are highly variable (you'll rarely see the same answers twice), but the frequency with which a brand appears across many queries converges to a stable signal. Visibility, measured as the proportion of responses in which a brand appears, behaves like a well-defined probability that can be estimated with standard statistical tools.

This raises a practical question: how many queries do you need before that estimate is reliable?

One prompt, one model

Start with the simplest possible setup. You have a single prompt, say "What are the best noise-cancelling headphones?,” and you're running it against a single AI model. Each time you run it, a given brand either appears in the response or it doesn't. That's a Bernoulli trial with some unknown probability p, and you're trying to estimate p.

After n runs, the sample proportion p̂ has standard error:

SE = sqrt(p(1-p) / n)

The 95% confidence interval margin of error is 1.96·SE. We don't know p in advance, but variance is maximized at p = 0.5, so we can get a conservative bound by assuming worst case:

MoE = 1.96 · 0.5 / sqrt(n) = 0.98 / sqrt(n)

Solving for n:

n ≥ (0.98 / MoE)²

Target margin of error	Queries needed
±10 pp	97
±5 pp	385
±1 pp	9,604

For a single prompt on a single model, you need about 100 queries to get within ±10 percentage points, and about 400 for ±5 pp. This is consistent with the SparkToro research, which ran prompts 60–100 times and found stable visibility patterns at that scale.

Adding associated prompts

In practice, you don't care about visibility for one exact prompt. You care about a topic area: not just "best noise-cancelling headphones" but "best headphones for travel," "top wireless earbuds for commuting," and so on. If a brand is visible for the underlying topic, it should show up across related prompts, and each associated prompt gives you additional observations of the same underlying probability.

With k associated prompts, each run produces k observations instead of 1. The margin of error becomes:

MoE = 0.98 / sqrt(k · m)

where m is the number of times you run each prompt. More associated prompts means fewer repetitions needed:

Associated prompts	Runs per prompt for ±5 pp
1	385
10	39
50	8
100	4

This is where the sample size problem becomes tractable. Ten related prompts run 39 times each, or fifty prompts run 8 times each, both get you to ±5 pp. The total number of queries is similar either way (~390 vs. ~400), but spreading across prompts has the advantage of estimating visibility across the topic rather than for a single phrasing.

Different wordings of the same prompt

A related axis of variation: instead of different prompts about the same topic, you can vary how a single prompt is phrased. "Best noise-cancelling headphones" asked by a frequent traveler might be worded differently from the same question from an audiophile or a college student. These variations serve a similar statistical function (each is an independent observation), but they also capture something different in substance. They tell you whether visibility is robust across the ways real people ask the question, rather than being an artifact of one particular phrasing.

If you have k associated prompts and j phrasings of each, a single pass produces k · j observations:

MoE = 0.98 / sqrt(k · j · m)

With 10 prompts and 7 phrasings, you get 70 observations per pass. At that density, a single pass gets you to ±12 pp, and four passes gets you to ±6 pp.

Multiple models

The final dimension is querying across AI models: ChatGPT, Claude, Gemini, Perplexity, and so on. Each model's response is an additional observation, so adding q models multiplies your sample size:

MoE = 0.98 / sqrt(k · j · q · m)

This assumes you're estimating a single "AI visibility" number across all models. With 10 prompts, 7 phrasings, and 7 models, a single pass generates 490 observations:

Target margin of error	Passes needed
±5 pp	1
±2.5 pp	4
±1 pp	20

A single pass is enough for ±5 pp. The sample size problem that looked daunting for a single prompt on a single model becomes trivial when you're working across a realistic set of prompts, phrasings, and models.

Slicing back down

The tradeoff is obvious: aggregating across dimensions buys you precision on the overall estimate, but the moment you want to slice by a specific dimension (visibility on a particular model, or for a particular prompt) you're back to a smaller effective sample.

With 8 phrasings, each phrasing contributes k · q = 100 observations per pass. With 10 prompts, each prompt contributes j · q = 80.

Target MoE	Passes (per-phrasing, 8 groups)	Passes (per-prompt, 10 groups)
±10 pp	1	2
±5 pp	4	5
±1 pp	97	120

How Gumshoe structures its reports

The math above informs how we designed Gumshoe's reporting system. A Gumshoe report is built around the three dimensions described in this post: a set of topics (associated prompts covering the category you care about), a set of personas (varied phrasings reflecting different user types), and a set of models. We call a single evaluation of all topic-persona-model combinations a "report run."

The default configuration (10 topics, 8 personas, 10 models) produces 800 observations per run. At that density, a single run yields an overall visibility estimate within ±5 pp at 95% confidence, which is precise enough to answer the most common question: how visible is my brand across AI tools? Running the report in quick succession tightens that to ±2.5 pp, which is sufficient for tracking changes over time. If your visibility moves by more than a few points between measurement periods, you can be confident the change is real.

Per-topic and per-persona breakdowns are where we're more careful about communicating uncertainty. A single run gives directional signal: you can see which topics are strong and which are weak. But the per-group confidence intervals are wide enough that small differences between topics shouldn't be over-interpreted. We recommend multiple runs for users who need to make decisions at the topic level, and we surface the effective sample size so users can judge for themselves.

At Gumshoe, a typical report run returns information from 880 conversations with models. For most use cases, a single run provides sufficient directional insight into overall AI visibility. When higher precision is required, teams can run additional reports to increase confidence, while still spending a small fraction of what brands typically invest in traditional search analytics.

We think this is the right way to approach AI visibility measurement: be transparent about what the numbers can and can't tell you at a given sample size, and make it easy to collect more data when the question demands it.

Nick Clark · Gumshoe · April 2025