Operating System · Last verified: MAY 2026

Chapter 21 — Three Metrics: Mention, Citation, Position

Definition

Three metrics describe a brand’s visibility in any AI answer engine: mention rate (does the brand name appear in the answer text), citation rate (does the brand’s domain appear as a source link), and position (when the brand is present, where does it appear — first cited, third in a list, footnote, sidebar block). The three are independent. They decouple severely in practice. A brand can be cited often and named rarely, named often and cited rarely, or present in the answer body but buried below the cut-off where most readers stop. A single composite “AI visibility score” hides which of these is happening — and the underlying optimization is different in each case.


Why it matters

The single most important measurement finding of the last twelve months is that mention and citation are not the same thing.

Kevin Indig’s April 2026 analysis of the Semrush AI Toolkit dataset — 3,981 domain appearances across 115 prompts, 14 countries, and 4 AI engines (ChatGPT, Google AI Overviews, Gemini, Google AI Mode) — produced the cleanest empirical breakdown to date1:

EngineMention rateCitation ratePattern
Gemini83.7%21.4%Names brands; rarely links them
ChatGPT20.7%87.0%Cites sources; rarely names brands
Google AI Mode37.6%76.3%ChatGPT-like, slightly more naming
Google AI Overviews61%84.9%Most balanced of the four

61.7 percent of all citations across the dataset were ghost citations — the brand’s domain was used as a source, but the brand name never appeared in the answer text. Only 13.2 percent of brand appearances generated both a citation and a mention1. A brand winning ChatGPT visibility is being treated as an academic footnote. A brand winning Gemini visibility is being named without being linked. These are not the same outcome, and they require different optimization work.

Three structural facts collapse out of this:

1. Mention and citation drive different shopper behaviours. A named brand mention (“the Garmin Forerunner 965 is the best pick for marathon training”) is a recommendation. A ghost citation (a Garmin product page used as a source for a generic running-shoe answer that names other brands) is content extraction without endorsement. The first converts; the second feeds the engine4. Tracking only one metric mis-reads which is happening.

2. Engines that look similar at the brand level diverge at the metric level. Aggregating “AI visibility” across engines hides the divergence Indig documented1. A 40 percent share-of-voice number that blends ChatGPT mentions, Gemini citations, and AI Overviews coverage tells you nothing actionable. Per-engine, per-metric is the floor.

3. Position changes the value of a citation by an order of magnitude. Aggarwal et al. (2024), in the original GEO paper, did not treat visibility as a binary mention/no-mention metric. They introduced Position-Adjusted Word Count and Subjective Impression as two distinct visibility metrics — both weighted by where in the generated response a brand appears, on the empirical evidence that earlier and longer mentions drive disproportionate impression2. Being the first brand recommended in a ChatGPT answer carries materially more weight than being the fifth — and “we got cited” reports treat both as equivalent.

For Shopify operators these collapse into one practical conclusion: tracking AI visibility as a single number is operationally useless. A three-metric framework — mention rate, citation rate, position — measured per prompt per engine against the test set from Ch. 20 is the smallest unit that produces optimization decisions interpretable enough to act on.


What separates real three-metric tracking from a single-number dashboard

Three properties distinguish three-metric tracking from the aggregated single-number reporting most stores use:

Tracked separately — never collapsed into a composite. Generic: subscribe to a GEO tool that reports a single “AI Visibility Score” or “Share of Voice” number. AI-aware: track mention rate, citation rate, and position as three separate columns per prompt per engine. The composite-score pattern is convenient and increasingly common5, but it discards exactly the information operators need: which metric is moving, on which engine, on which prompt category. The single number is the headline; the three sub-metrics are where the work happens.

Per-engine, never blended. Generic: report “AI visibility” averaged across ChatGPT, Perplexity, Gemini, AI Mode, and Copilot. AI-aware: report each engine separately and compare. Indig’s data shows ChatGPT and Gemini behave as inverse-functions of each other on mention vs citation1. Chen et al. (2025) document engine-by-engine differences in “domain diversity, freshness, cross-language stability, and sensitivity to phrasing”3. A blended number is the average of behaviours that are structurally divergent. The blended number can rise while three of five engines decline.

Per-prompt drill-down — never just category averages. Generic: report aggregate citation rate by category (“informational: 24%, consideration: 18%, transactional: 11%”). AI-aware: keep the per-prompt grain accessible. The category average hides which specific prompts within the category are wins, losses, or volatility-only noise. The optimization work happens at the prompt level — fixing the three transactional prompts where citation is zero is different work from fixing the four where it’s 80 percent.

The principle across all three: a metric is only useful if it isolates the optimization decision it should drive. Aggregation hides decisions; disaggregation surfaces them.


How position is measured

Position is the fuzziest of the three metrics. The other two are binary per prompt — was the brand mentioned (yes/no), was it cited (yes/no). Position is ordinal and varies by engine surface.

The simplest workable position scheme — adequate for $100k+/mo Shopify stores running manual or lightly-tooled tracking — is a 4-tier categorization per prompt:

TierPosition typeWhat it looks like
1PrimaryNamed first in the answer, or the answer’s headline recommendation
2SupportingNamed in the answer body but not as the lead recommendation
3Citation-only (ghost)Domain cited as source, brand name absent from answer text1
4Footnote/sidebarListed in citation block or expandable sources panel only, not in the visible answer

The 4-tier scheme is enough to detect the patterns that drive optimization decisions. Stores running on dedicated GEO platforms can move to Aggarwal et al.’s Position-Adjusted Word Count or to vendor-specific composite position scores25; for most operators the 4-tier categorization captures the actionable signal without the methodology overhead.


The system

CadenceTaskDifficultyNote
SetupDefine metric definitions in writing — what counts as a mention, what counts as a citation, what counts as each position tier🟡Definitional drift is the silent killer of multi-quarter trend data
SetupBuild the tracking sheet — three columns per engine (mention Y/N, citation Y/N, position 1-4), 25 rows per prompt🟢One sheet, five tabs (one per engine), one row per prompt
SetupEstablish the baseline — first weekly run across all 5 engines and all 25 prompts before any optimization work6🔴The baseline is what every later number gets compared against
WeeklyRun the test set; log mention, citation, position per prompt per engine🟡90-120 min manual; faster with automation once volume justifies
WeeklyTag any ghost citation (cite without mention) and any mention without citation explicitly🟢These are the two patterns that single-metric tracking misses
WeeklyFlag week-over-week movement >20 percentage points on any single metric🟡Threshold separates real shift from non-determinism noise
MonthlyCompute per-engine, per-metric, per-category aggregates from raw data🟡Aggregation happens at month boundary, not in the raw data layer
MonthlyRank competitor citation rate, mention rate, and primary-position rate against your brand’s🟡Identifies which 3-5 brands AI engines treat as your competitive set per metric
MonthlyFlag every prompt where ghost-citation rate exceeds 50%🔴High ghost rate signals content is feeding the engine without earning the recommendation — fixable, but only if surfaced
QuarterlyRecalibrate metric definitions if engine behavior has shifted (new answer-format roll-outs, new sidebar UI, etc.)🟡Engines change UI; what counted as “primary” three months ago may now be “supporting”
QuarterlyCross-reference position movement against earned-media additions (Ch. 14) and content refreshes (Ch. 23)🔴Tests the causal hypothesis: did the input work move the output metric
AnnualFull metric framework review against current engine landscape🔴New engines launch; old engines change citation/mention behavior; framework adapts

Common gaps (8 out of 10 audits)

  • Single-metric reporting only. The store tracks “AI mentions” or “AI citations” but not both, and never position. Misses the ghost-citation pattern entirely — a 60 percent citation rate looks healthy until you discover 80 percent of those citations are ghost.
  • Aggregated across engines. Dashboard shows “AI visibility: 32%” with no per-engine breakdown. Loses the structural fact that ChatGPT and Gemini citation behavior are inverses of each other1. Cannot diagnose which engine is moving.
  • No definition of what counts as a mention. Brand mention vs product mention vs domain reference treated interchangeably. Trend data drifts as the unstated definition shifts week over week.
  • Position not tracked at all. Brand is “in the answer” recorded as a binary yes/no. Top-billed primary recommendation and three-tier-down footnote are recorded the same way. The signal that matters most for actual conversion is invisible.
  • No competitor benchmark per metric. The store knows its own citation rate is 18 percent; doesn’t know whether the category leader is at 45 percent or 22 percent. Without the comparison, the absolute number is uninterpretable.
  • No per-prompt drilldown. Reporting stops at category averages. Cannot see which specific prompts in the consideration category are zero-citation vs which are 100 percent. Optimization remains generic.

Paid layer connection

The three metrics also diagnose paid layer opportunity differently. Zero citation, zero mention prompts are paid layer candidates — buy the placement on ChatGPT Ads (Ch. 25) until earned-media work catches up. High citation, zero mention prompts (the ghost-citation pattern) are content-architecture problems, not paid problems — the AI is reading the brand’s pages and not endorsing them; spending paid won’t fix this until the underlying PDP and policy schema work signals to the AI that the brand is a recommendation, not a knowledge source. High mention, low position prompts are competitive defense — the brand is in the conversation but not winning it; consider both content depth (Ch. 11) and selective paid bidding to displace the primary recommendation. Single-metric tracking cannot make any of these distinctions; the three-metric framework makes them obvious.


Deeper dive

Standalone posts will go further on:

  • The ghost citation diagnostic playbook — engine-specific patterns, when ghost citation indicates content extractability working as intended vs failing, and the content-architecture moves that convert ghost citations into mentions
  • Position scoring methodology — comparing the 4-tier manual approach against Aggarwal et al.’s Position-Adjusted Word Count for stores ready to graduate to weighted scoring

Subscribe → — 4x weekly. Deep-dives ship here.


  1. Indig, K. (April 2026). The Ghost Citation Problem. Growth Memo. growth-memo.com/p/the-ghost-citation-problem. Documents the dataset of 3,981 domain appearances across 115 prompts, 14 countries, and 4 AI engines (ChatGPT, Google AI Overviews, Gemini, Google AI Mode), with engine-by-engine mention vs citation rates: Gemini 83.7%/21.4%, ChatGPT 20.7%/87.0%, AI Mode 37.6%/76.3%, AI Overviews 61%/84.9%; 61.7% ghost-citation share; 13.2% mention+citation conversion rate. Full reference →
  2. Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). GEO: Generative Engine Optimization. Proceedings of the 30th ACM SIGKDD Conference (KDD ‘24). arXiv:2311.09735. Introduces Position-Adjusted Word Count and Subjective Impression as foundational visibility metrics; demonstrates that GEO method effectiveness varies across query types and domains. Full reference →
  3. Chen, M., Wang, X., Chen, K., & Koudas, N. (September 2025). Generative Engine Optimization: How to Dominate AI Search. arXiv:2509.08919. Documents engine-by-engine differences in domain diversity, freshness, cross-language stability, and sensitivity to phrasing — empirical justification for per-engine measurement. Full reference →
  4. TheRankMasters (February 3, 2026). How We Win AI Answers Visibility (mentions/citations) in 2026. Documents the three-metric KPI stack (visibility, sentiment, citation as separate realities) and the framing of citations as a compounding asset distinct from brand mentions. therankmasters.com/insights/ai-visibility/get-cited-ai-answers. Full reference →
  5. Digital Applied (April 27, 2026). AI Search Visibility Score: A Proprietary Metric Spec. Documents the explicit limitation that citation rate alone misses position effects, answer share alone misses persistence, and position-only metrics miss the universe of uncited prompts; proposes composite metric construction with empirical regression-derived weighting (0.35/0.25/0.25/0.15) across four sub-metrics. digitalapplied.com/blog/ai-search-visibility-score-proprietary-metric-spec. Full reference →
  6. Discovered Labs (January 9, 2026). GEO Metrics: What KPIs Matter & How to Track Them (2026). Documents the Citation Rate calculation formula (Your citations / Total citations across all brands tested) × 100, baseline benchmarks (8-15% minimal, 20-30% gaining traction, 40-50%+ strong category visibility), and the discipline of competitive benchmarking against the same prompt test set. discoveredlabs.com/blog/geo-metrics-what-kpis-matter-how-to-track-them-2026. Full reference →