For technical SEO leads, AEO practitioners, and content engineers
How AI Models Decide Which Brands to Cite
AI answer engines do not rank sources the way Google does. They run a multi-stage pipeline of retrieval, reranking, and policy filtering. Influence on citation choice comes from the structural and entity-layer signals fed into each stage.
By Ali Jakvani, Cofounder
Most teams treat AI citation as a black box. It is not. Modern answer engines, regardless of vendor, follow the same coarse architecture, and knowing which stage you are losing at is the entire diagnostic.
The high-level pipeline
- Query reformulation. A single user question is decomposed into 3 to 10 retrieval queries.
- Candidate retrieval. Each sub-query hits a vector index, a keyword (BM25) index, or a live web search API. Top N candidates are pulled, often N = 50 to 200.
- Reranking. A smaller cross-encoder model scores each candidate against the query. Top K (often 5 to 20) survives.
- Policy filtering. Source diversity, blocklists, freshness windows, and trust filters trim the survivors.
- Generation. The LLM composes an answer using the surviving passages as context.
- Citation selection. Some engines cite every retrieved source; others cite only the passages whose content was materially used.
What the retrieval layer rewards
Dense (vector) retrieval
Vector retrieval embeds the query and candidate passages into the same vector space and returns nearest neighbors. Semantic match beats keyword match. Most production systems chunk pages into 200 to 1,000 token segments. A page with one wall of text becomes one or two clumsy chunks. A page with discrete H2 sections and clean paragraph breaks becomes many cleanly retrievable chunks.
Keyword (BM25) retrieval
BM25 still gets heavy use, especially for proper nouns, product names, and rare terms. Exact-match phrases survive. Pages that paraphrase aggressively away from query terminology lose ground in pure BM25. Most engines run hybrid retrieval and re-weight via the reranker.
Live web search
Some engines (Perplexity, ChatGPT browsing, Google AI Overviews) augment retrieval with live web search. Pages that rank well in Google or Bing are more likely to be pulled in. Crawlability for AI agents specifically also matters: if GPTBot or PerplexityBot is blocked, those engines will not retrieve the page even if it ranks.
What the reranker rewards
The reranker is the stage where most AEO-quality investments pay off. Cross-encoder rerankers are more expensive but more accurate than embedding-only retrieval. Observed preferences across systems:
- Direct relevance. Passages that explicitly answer the sub-query outscore passages that contextually relate.
- Self-contained passages. Chunks that can be understood without surrounding context outscore chunks that depend on prior paragraphs.
- Claim-first structure. Lead with the claim, then justify it.
- Definitional clarity. "X is Y" statements get strong scores for definitional queries.
- Structural cleanliness. Passages without nav fragments, cookie notices, or boilerplate intrusions score better.
What the policy and citation layer rewards
Source authority
Authority here is multidimensional: domain reputation, editorial signals (named authors with credentials, dates, primary references), citation graph (whether other authoritative sources cite the page), and EEAT alignment.
Source diversity
Citation policies typically penalize over-reliance on a single domain. If three of the top retrieved passages are from the same source, the system often picks one and replaces the others with passages from different domains. Domain saturation is self-defeating.
Freshness
For time-sensitive queries, retrieval and rerank stages bias toward recent content. Sitemap lastmod that reflects actual changes helps. Visible "Last updated" dates help. Stale signals like outdated examples, obsolete product names, or expired tooling references hurt.
Citation friendliness
Quotable claims, numbered or labeled facts, definition blocks, and table cells are all easy for the generator to cite. Narrative arguments where the claim is distributed across paragraphs are hard to cite even when the reasoning is sound.
A model of citation probability
Brand-specific signals
Entity coherence
Models maintain implicit entity graphs. A brand that is consistently named, consistently described, and consistently linked across the web becomes a stable node. A brand whose name shifts between three variants becomes a fuzzy node that models hedge against.
- One canonical brand name across all surfaces.
- Consistent category language used the same way across owned and earned media.
- sameAs links from your Organization schema to authoritative external profiles.
- Consistent founder and key-person bios across owned and external profiles.
- Consistent product naming across all docs.
Citation graph position
A brand cited frequently by other authoritative sources accumulates a citation prior. This is different from backlink count. The relevant graph includes industry analyses, podcast and conference references with named association, Wikipedia presence (where notability is met), trade publication mentions, and tool or product database listings.
Citation-friendly vs citation-hostile structures
Citation-friendly
"Dense retrieval is a technique that embeds queries and documents into the same vector space using a neural model, then returns the nearest neighbors of the query as candidate documents. It is typically combined with keyword retrieval (BM25) in production systems to balance semantic and exact-match relevance." A reranker scoring this for "what is dense retrieval" gives high marks: claim-first, self-contained, definitional, quotable.
Citation-hostile
"We have spent a lot of time thinking about how information retrieval has evolved. From the early days of inverted indices through the rise of large-scale search engines, the field has been on a long arc. Vector spaces, of course, have changed things again." Same topic, no extractable claim. A reranker downscores it.
What you actually do about it
- Audit your retrieval surface. Are you crawlable by GPTBot, ClaudeBot, PerplexityBot, Google-Extended? Is render parity intact for non-JS agents?
- Engineer for the reranker. Rewrite priority pages with question-led H2s and direct answers in the first 60 words of each section.
- Stabilize your entity graph. Audit naming, lock in canonical strings, build sameAs links, ensure consistency across owned and external surfaces.
- Monitor citations. Probe target engines on a defined prompt panel and watch citation share over time.
Frequently asked questions
Do AI engines look at backlinks?
Yes, indirectly. Backlinks influence the credibility prior at the policy stage and contribute to the broader citation graph. They are no longer the dominant signal but they are not absent.
How does Perplexity decide what to cite?
Perplexity uses hybrid retrieval (live web search plus its own indices), reranks candidates, applies source diversity policies, and tends to cite passages whose content is materially used in the answer. Direct, claim-first passages with clean structure are preferred.
How does ChatGPT decide what to cite when browsing?
ChatGPT's browsing pipeline retrieves candidate URLs via search, fetches and chunks them, reranks for relevance, and includes citations for the sources used to compose the answer. Crawlable HTML, render parity, and extractable passages all influence selection.
What is the most underrated AEO signal?
Render parity. Many sites serve different HTML to bots than to browsers, often unintentionally because of JavaScript rendering. AI agents that do not execute JS see a stripped page and downscore it.
Does freshness apply to evergreen topics?
Less aggressively, but yes. A page on a stable topic that has not been updated in years still loses to an equivalent page updated last month, because freshness is one input among many.
How do I track which engines cite my brand?
You probe each engine on a defined prompt panel and record cited domains. Multi-engine citation monitoring tools automate this; an in-house version is a scripted prompt loop plus a database.
References
- [1]Foundational IR literature on BM25 and probabilistic retrieval.
- [2]Karpukhin et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. arXiv.
- [3]Nogueira & Cho (2019). Passage Re-ranking with BERT. arXiv.
- [4]Schema.org — Organization and sameAs reference.
- [5]OpenAI — GPTBot and OAI-SearchBot documentation.
- [6]Anthropic — Claude web access and ClaudeBot documentation.
- [7]Perplexity — citation-first answer engine.
- [8]Google — official posts on AI Overviews.
- [9]Kumar & Palkhouski (2025). AI Answer Engine Citation Behavior. arXiv.
Continue reading
The Technical Architecture Behind AI-Readable Websites
Read nextScan your domain
Want to see how your brand shows up in AI answers?
Run a free AI-Readiness scan. Get a 13-factor score and a live response from ChatGPT, Claude, Perplexity, and Gemini. No signup required.