Operating System · Last verified: MAY 2026

Chapter 20 — Prompt Test Set

Definition

A prompt test set is a standardized library of 25 ecommerce-relevant prompts, locked in structure, run weekly across all five major AI commerce surfaces — ChatGPT, Perplexity, Gemini, Google AI Mode, and Microsoft Copilot — to track the brand’s mention, citation, and position over time. Same 25 prompts every week. Same five engines every week. Same locked structure for at least a quarter. Without this floor, every other measurement chapter in this part of the encyclopedia produces noise instead of signal.


Why it matters

You can’t optimize what you can’t measure. You can’t measure AI search reliably if every check uses different prompts, on different engines, on different days.

Three structural facts force the discipline:

1. AI engine outputs are non-deterministic. The same prompt typed into ChatGPT twice can return different sources, different brand mentions, different ordering. Search Engine Land’s analysis of the Semrush AI Visibility Index found that across 2,500 tracked prompts on Google AI Mode and ChatGPT, between 40 and 60 percent of cited sources change from month to month1. Without a fixed prompt set, you cannot tell whether a citation gain is real progress or random variation.

2. Engines source from substantially different pools. Chen et al. (September 2025) demonstrated empirically that AI search services differ significantly from each other in domain diversity, freshness, cross-language stability, and sensitivity to phrasing2. A brand cited 40 percent of the time in ChatGPT may appear in 8 percent of equivalent Perplexity answers. Single-engine measurement systematically misreads a Shopify store’s actual AI visibility.

3. Prompt category mix changes the answer. The original GEO paper (Aggarwal et al., KDD 2024) anchored its GEO-bench benchmark on roughly 10,000 queries split 80 percent informational, 10 percent transactional, 10 percent navigational — and showed empirically that GEO method effectiveness varies by domain and query type3. Chen et al. extended this for AI Search: for consideration queries like “Garmin vs Apple Watch,” earned content dominates across all engines, but the mix differs — Google pairs earned content with substantial social signal, ChatGPT skews almost entirely to earned with minimal brand and social, and Perplexity maintains a more balanced spread2. A prompt set tilted only toward branded queries hides this — and hides where the brand is bleeding share.

For Shopify operators, these collapse into one practical conclusion: a 25-prompt set, run weekly across all 5 engines with a fixed category mix, is the smallest unit that produces directional measurement reliable enough to act on. Smaller undersamples — natural variation drowns the signal. Larger — the 100-to-200-prompt scale common in B2B GEO programs4 — is unsustainable manually and adds operational overhead without changing decisions for a focused Shopify catalog. The leverage is in disciplined repetition, not test-set volume.


What separates a real prompt test set from a vibe check

Three properties distinguish prompt sets that drive optimization from prompt sets that produce theatre:

Locked structure, with quarterly refresh — not ad-hoc additions. Generic: someone adds a new prompt mid-month and the historical baseline breaks. AI-aware: the 25 prompts are fixed for at least a quarter; refresh additions happen on a calendar, with each new prompt tracked separately until it accumulates four weeks of history. Trending data only works if the denominator is constant. A continuously edited set measures nothing4.

Real shopper language, sourced from owned data — not keyword tools. Generic: prompts derived from SEO keyword lists (“best running shoes flat feet”). AI-aware: prompts derived from support tickets, reviews, customer DMs, on-site search-bar logs, and PDP question fields4 — the natural-language phrasing real shoppers use when they ask AI a buying question. Shoppers do not type three-word fragments into ChatGPT; they type 15-to-25-word situational questions. The prompt set must mirror that.

Category-balanced mix — not 25 variations of “best of.” Generic: 25 versions of “best [category] for [audience].” AI-aware: a structured mix across the funnel — informational, consideration, transactional, branded, and comparison. Chen et al.’s data shows engines source differently for each category2; a single-category set hides where the brand wins and where it is invisible.

A prompt set is a measurement instrument. Its value compounds with stability, breadth, and signal-fidelity to actual shopper behaviour — not with novelty.


The system

CadenceTaskDifficultyNote
SetupSource the prompt library from owned data — support tickets, reviews, on-site search-bar logs, PDP question fields🟡First pass yields 60-100 candidates; cut to 25
SetupCategorize and balance the set across informational, consideration, transactional, branded, and comparison🟡Roughly 5 prompts per category; weight toward consideration and transactional for ecommerce
SetupLock the test set in a versioned document — date-stamped, with category tags and source attribution per prompt🟢Version control is what makes historical comparison valid
SetupEstablish baseline — run the full set across all 5 engines and record citation, mention, and position before optimization work begins5🔴Without a baseline, no later result is interpretable
WeeklyRun the full 25-prompt set across all 5 engines🟡Manual run is 90-120 minutes; automated tracking platforms cover this once volume justifies it
WeeklyLog mention, citation, and position per prompt per engine in the tracking sheet🟢Three columns per engine; never aggregate engines in the raw data layer
WeeklyFlag any prompt where the brand’s signal moved by more than 20 percentage points week-over-week🟡Volatility flag separates real shifts from noise
MonthlyCompute share-of-model per engine, per category, and overall🟡The monthly view is what you act on; the weekly view catches anomalies
MonthlyReview competitor citation patterns — which 3-5 brands appear most in your set, on which engines🟡Identifies the brands AI engines treat as your category set
QuarterlyRefresh the set — retire 3-5 prompts that no longer reflect shopper language; add 3-5 from the most recent owned-data pull🟡Always retire and add in matched pairs; never net-grow
QuarterlyRealign category balance if the catalog mix has shifted — new SKU lines, retired categories🟢The set should track the business, not its founding catalog
AnnualFull prompt-set rebuild — re-source from scratch against current shopper language and current catalog6🔴Annual rebuild prevents drift into measuring what the brand used to be

Get access to The Library → Implementation playbooks. July 2026. Earlier founders pay less; locked at #24.

Common gaps (8 out of 10 audits)

  • No test set at all. The most common gap. The store checks AI visibility once when somebody asks at a board meeting, then forgets for three months. By the time visibility is re-checked, the engines have moved, the catalog has changed, and there is no baseline to compare against.
  • ChatGPT-only testing. The store assumes the largest engine is the only engine that matters. Skips Perplexity, Gemini, AI Mode, Copilot. Misses Chen et al.’s engine-by-engine divergence — a brand citation-rich on ChatGPT can be almost invisible on Perplexity for the same category2.
  • Prompts derived from keyword lists, not shopper language. Marketing pulls 25 high-intent keywords from an SEO tool and treats them as prompts. The set measures keyword visibility instead of shopper-question visibility — and the gap matters more on AI than it ever did on Google.
  • All prompts are “best [category]” variations. No informational prompts. No comparison prompts. No branded. Single category, single answer pattern, no view of where the brand wins or bleeds across the funnel.
  • Set edited mid-quarter. Someone adds a new prompt because a competitor launched a product. Trend lines reset. Three months of work becomes uninterpretable. The discipline of waiting until the quarterly refresh is what fails first.
  • Run once, never re-run. A consultant ran the set in December as part of a “GEO audit deliverable.” It has not been run since. The 40-to-60-percent monthly source churn documented across 2,500 tracked prompts1 means a five-month-old snapshot is noise, not data.

Paid layer connection

The same prompt test set that measures organic AI visibility is the cleanest input for ChatGPT Ads targeting decisions (Ch. 25). Prompts where the brand has zero organic citation but high commercial intent are paid layer candidates — buy the placement until the earned media work (Ch. 14) catches up. Prompts where the brand already dominates organically are defensive bidding decisions — pay only if a competitor enters the auction and threatens displacement. The decision rule is impossible without the prompt set in place; with it, the same weekly run drives both organic optimization and paid roadmap.


Deeper dive

Standalone posts will go further on:

  • The 25-prompt ecommerce sourcing methodology — extracting prompt candidates from support tickets, reviews, search logs, and PDP question fields, with category-balance worksheets
  • Manual cross-engine prompt-run workflow — the operational sequence for running 25 prompts × 5 engines weekly without paid tooling, before automation is justified

Subscribe → — 4x weekly. Deep-dives ship here.


  1. Search Engine Land (February 16, 2026). Generative engine optimization (GEO): How to win AI mentions. Reports the Semrush AI Visibility Index analysis of 2,500 tracked prompts across Google AI Mode and ChatGPT showing 40-60% of cited sources change month-over-month. searchengineland.com/what-is-generative-engine-optimization-geo-444418. Full reference →
  2. Chen, M., Wang, X., Chen, K., & Koudas, N. (September 2025). Generative Engine Optimization: How to Dominate AI Search. arXiv:2509.08919. Documents engine-by-engine differences in domain diversity, freshness, cross-language stability, and sensitivity to phrasing; documents differential earned/brand/social source mix across query categories (informational, consideration, transactional). Full reference →
  3. Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). GEO: Generative Engine Optimization. Proceedings of the 30th ACM SIGKDD Conference (KDD ‘24). arXiv:2311.09735. Introduces GEO-bench (10,000 queries, 80% informational / 10% transactional / 10% navigational); demonstrates GEO effectiveness varies across domains and query types. Full reference →
  4. Birkett, A. (March 6, 2026). How to Measure AI Share of Voice (+ 3 Tools). Documents prompt-sourcing methodology (sales calls, support tickets, win/loss interviews, on-site search), 100-200-prompt range for B2B programs, and quarterly refresh discipline aligned to voice-of-customer cycles. alexbirkett.com/ai-share-of-voice. Full reference →
  5. Stackmatix (April 2026). Best GEO Tools Guide: AI Search Visibility Platforms in 2026. Documents the baseline-then-re-measurement discipline — record starting citation frequency and share of voice, implement changes, re-run tracked prompts every 30 days to measure improvement. stackmatix.com/blog/geo-tools-guide. Full reference →
  6. Ahrefs (February 16, 2026). 7 Steps for Tracking Your ChatGPT Visibility With Ahrefs. Documents prompt category framework (branded searches, bottom-funnel terms, new product features), daily/weekly/monthly cadence options, and competitor benchmarking via tracked prompt sets. ahrefs.com/blog/chatgpt-visibility-tracking. Full reference →