How to build a weekly citation-rate audit across ChatGPT, Perplexity, Gemini, and AI Overviews

A concrete architecture for running a weekly automated citation-rate audit across the four AI surfaces that matter. Tools, prompt schema, database structure, and the build checklist — designed to be copied. Not a SaaS pitch, an engineering pattern.

Elizabeth S.

Founder 25 May 2026 6 min read

Summarize with AI Open this article in your preferred assistant

In this article

01 What the automation does
02 The stack, named explicitly
03 Three design choices that matter more than they look
04 The 50 prompts: how to pick them
05 What the digest looks like
06 What the data unblocks
07 Build effort
08 What to do this week if you want this

A weekly citation-rate audit is one of those operational rituals that produces almost no insight on most weeks and the occasional decisive signal on the weeks that matter. Done by hand it eats two to three hours of someone’s Monday and quietly stops happening when the team gets busy. Done as an automation it costs an afternoon to build and runs forever.

Why bother at all? Two 2026 studies put a number on the value of being cited. Seer Interactive’s November 2025 research (42 brands, 25M+ impressions) found that organic CTR runs 35% higher when your brand is cited inside the AIO vs an AIO that does not cite you. And BrightEdge’s 2026 industry tracker found that in finance, only ~11% of AIO citations come from organic top-10 pages — meaning ranking and being cited are increasingly different problems. Citation rate is the metric that catches a leak ranking trackers can’t see.

This article is the architecture. Not a SaaS pitch, not a tutorial — a pattern you can copy, adapt, and own.

What the automation does

Once a week, on a fixed schedule, the system reads a curated set of 50 prompts from a sheet. For each prompt, it runs four parallel API calls — ChatGPT, Claude, Perplexity, and Google AI Overviews (via SerpAPI). For each response it checks whether the tracked brand is named, whether named competitors are named, and what domains are cited.

It writes 200 rows to a database (50 prompts × 4 surfaces), computes a week-over-week delta against the previous run, rolls up a short digest, and posts it to Slack or email.

By Monday morning the work is done and the question worth asking is “did anything move?” — not “what is the number?”.

The stack, named explicitly

Orchestrator: Make.com or n8n. Either works. Make.com is faster to ship if you prefer visual scenarios; n8n is faster if you want self-hosted control. Anything that supports iterators, aggregators, scheduled triggers, and HTTP modules will do.
Models: the public APIs. OpenAI for the ChatGPT surface. Anthropic for the Claude surface. Perplexity sonar models (sonar-pro recommended) — Perplexity exposes a citations field natively, which is the cleanest source for cited-domains data. SerpAPI for the Google AI Overview surface — it is the most reliable way I have found to parse AIOs programmatically along with the cited sources.
Storage: Notion, Airtable, or Postgres. Notion is friction-free up to a few thousand rows. Migrate to Postgres or Airtable once volume is sustained.
Prompt registry: Google Sheets or Notion. Keep it editable. The friction of a human editing rows is what keeps the prompt set honest.
Human surface: Slack or email. Six lines, no charts. Detail lives in the database, linked from the digest.

Three design choices that matter more than they look

These are the ones that bite teams who try to build this from a description.

The row schema is per (prompt × surface × week), not per prompt. It is tempting to make each prompt a row with four columns (one per surface). It is unusable. You cannot compute “Perplexity citation rate this week vs last” with that schema without writing custom joins. The expanded schema makes every interesting query a one-line filter.

Cap parallelism at 6–8, not unbounded. Make.com (and most orchestrators) will happily fire 200 parallel requests. Perplexity’s rate limit will start rejecting them and your numbers go non-deterministic. The cap costs you a few minutes of total runtime and saves a morning of debugging.

The source-listing pass runs as a separate API call. Asking a model “answer this and list your sources” measurably degrades the answer quality — the model is optimizing two outputs at once. Better to ask the real question first, then send a second turn asking the model to retroactively list the sources it would have cited. Imperfect, still useful for directional data.

The 50 prompts: how to pick them

The prompt set is the entire point of the automation. The automation runs every week; the prompts should be revised quarterly. A workable procedure:

Pull the questions buyers have asked the sales pipeline in the last 90 days. If your CRM supports tagging, tag anything that starts with “how”, “what”, “which”, “is X better than”.
Pull the questions the customer-support chat widget has logged in the last 90 days. Intercom, Crisp, and similar tools all export.
Filter to questions that are answerable in a public AI surface. Strip questions that depend on private knowledge (“does your product integrate with my internal CRM”).
Group by category and pick a handful per category. Aim for around 50 total — small enough to read by hand, large enough to smooth weekly noise.

No keyword tool. No search-volume filter. The prompts are buyer-language, not search-query-language. The two diverge more than most marketers admit.

What the digest looks like

The digest is short on purpose. Six lines, designed to be readable in the four seconds between opening Slack and deciding whether to open the database.

🟢 Brand citation rate: 31% (+2 pp vs last week)
🟢 ChatGPT: 38% (+4 pp)  ·  Perplexity: 41% (+2 pp)
🟡 Gemini: 24% (−1 pp)   ·  AI Overview: 22% (+3 pp)
🔴 Lost cites: 3 prompts  (open thread for list)
🟢 Gained cites: 7 prompts
↘  Top competing domain this week: example.com

Numbers and colors matter for peripheral-vision triage. The “open thread for list” link points to a filtered database view — the work the team actually does lives there.

What the data unblocks

The point of the automation is not the report. It is what the report unblocks.

Lost cites this week → look at the prompts. If the answer changed because a competitor published a new page, that’s a content gap. If the answer changed because the model updated its priors, that’s a structural problem with entity reinforcement.

Gained cites this week → check what shipped two to four weeks ago that might be responsible. The lag between publishing a citation-friendly page and seeing it surface in answers is real, though it varies by surface and language. Tracking the lag for your own program over time is far more useful than relying on a published benchmark.

Surface divergence → if ChatGPT citation rate moves up while Gemini moves down on the same prompts, something is shifting at the model layer. Worth a note, not always worth acting on immediately.

Most weeks the digest produces no action. That is correct. The point is to detect when something has moved, not to manufacture work. The automation pays for itself the few weeks a year when a real change in the data lets the team act in seven days rather than seven weeks.

Build effort

Plan a focused afternoon for v1 if you are comfortable with no-code orchestrators. Add a second afternoon for v2 once you have lived with v1 for two or three weeks and know what you want to change. The biggest hidden cost is not the build — it is the discipline of curating the prompt set and refreshing it quarterly.

What to do this week if you want this

Three options, in order of effort.

DIY. The checklist on this article is enough to rebuild from scratch. Don’t skip the parallelism cap.
Hybrid. Run the DIY build for internal signal, complement it with a SaaS tool like Profound or Peec when you need defensible numbers for an external report.
Done-for-you. We build versions of this for clients as part of a GEO engagement. The footer contact link is the right entry point if you want to scope it.

The automation is the cheapest part of operating a serious GEO program. The expensive part is the discipline to look at the digest every week and act on the few signals worth acting on. The build below is mostly a way to make the discipline cheaper.

The architecture, in order

From scheduled trigger to Slack digest

Scheduled trigger fires

Pick a quiet hour (e.g. early Monday morning). The scenario runs unattended.
Read prompt sheet

Pull the 50 prompts, each tagged with the brand to track and 2–4 competitor names.
Fan out to four surfaces

50 prompts × 4 surfaces = 200 API calls. Cap concurrency at 6–8 to respect Perplexity's rate limit; the slowest surface dictates total runtime.
Parse + match

Each response is checked for brand mention, competitor mention, and cited domains. For surfaces that don't expose citations natively, use a JSON-mode follow-up call to ask the model to list sources.
Write to database

One row per (prompt × surface × week). Set the ISO-week field so weekly diffs are a simple filter, not a join.
Compute week-over-week delta

Look up the matching row from the previous ISO week. Mark each cell gained / lost / stable.
Aggregate + post digest

Roll up citation rate by surface, list the top gained and lost prompts, and post a short message to Slack with a link back to the database view for detail.

Why automate this at all

“Doing the audit manually is workable for one or two weeks. The reason to automate is that consistency is what makes week-over-week deltas mean anything. A measurement done four different ways across four different Mondays is noise.”

— Elizabeth S.

Founder, Citable

Frequently asked

Questions buyers ask before booking

Why not buy a citation-tracking tool like Profound or Peec?

Those are great when you need defensible numbers for an external report, with proper SOV history and citation source-of-truth across model versions. The DIY version is the right tool when you control the prompt set, want to iterate it weekly, and need a low-friction internal dashboard. The two coexist on a mature program — the SaaS for board-level reporting, the DIY for operational signal.

Is 50 prompts enough to be statistically meaningful?

For absolute citation rate, no — you would want a much larger set for tight confidence intervals. For week-over-week direction across a stable prompt set, 50 is reasonable. The signal you care about is the delta, not the absolute level. A drop on a fixed 50-prompt set is real even if the underlying level has wide error bars.

How do you handle ChatGPT not exposing citations directly via the API?

Two passes. First pass: ask the original question and capture the answer text. Second pass: send a short system prompt asking the model to list the sources it would have cited to produce that answer, output JSON. It is imperfect — the answer-mode and the source-listing-mode aren't identical — but it produces useful directional data for a weekly audit.

What's the biggest failure mode I should worry about?

Prompt-set rot. The reason this works is that the prompts reflect the questions buyers actually ask today. Lock the prompts in once and never revisit, and by next year you're tracking a question shape buyers no longer use. Refresh the prompt set quarterly using questions sales calls and chat widgets have actually surfaced.

Where do I get help if I want this built for my own brand?

Two paths. If you're comfortable with no-code workflows, the checklist on this page is enough to build v1 in a focused afternoon. If you want it built for you as part of a GEO engagement, that's a fixed-scope deliverable in our retainer. The contact link in the footer is the right entry point.

Plan a focused afternoon for v1. Then it runs itself.

The build checklist — twelve steps from blank canvas to live digest

Curate 50 prompts in a Google Sheet. One column: the prompt. Second column: the brand to track. Third column: 2–4 competitor names. Real buyer questions only — no keyword-tool exports.
Provision API keys: OpenAI, Anthropic, Perplexity (sonar-pro), SerpAPI (for the Google AI Overview parse).
Create a Notion (or Airtable / Postgres) database with these columns: Prompt, Surface, Week-ISO, Brand-mentioned (checkbox), Competitor-mentioned (multi-select), Raw-answer (long text), Citation-domains (multi-select), Delta-from-last-week (select: gained / lost / stable).
Build a Make.com (or n8n) scenario triggered on a weekly schedule.
Module 1: read the 50 prompt rows from Google Sheets.
Module 2 (Iterator): for each prompt, fan out four parallel API calls — OpenAI, Anthropic, Perplexity, SerpAPI.
Module 3 (Text Parser): regex-match each response against the brand and competitor strings. Output booleans plus matched domains (Perplexity exposes a citations field natively; for ChatGPT and Claude, use a second JSON-mode call asking the model to list the sources it would cite).
Module 4: create one database row per (prompt × surface). Set Week-ISO to the current ISO week.
Module 5: search the database for the same prompt × surface in the previous ISO week. Compute Delta-from-last-week.
Module 6 (Aggregator): roll up citation rate across all 50 prompts × 4 surfaces. Format a short digest message.
Module 7: post the digest to Slack (or email, or a dashboard page).
Error handler: on any API failure, log to a separate failures table and continue. A single surface outage should never block the other three.