Reverse engineering Google AI Mode's sentence-level citation behaviour using #:~:text= URL fragments.
Every Google AI Mode and Gemini citation URL contains a hidden Web Text Fragment anchor:
https://example.com/page#:~:text=Exact%20sentence%20Google%20cited%20here
Decode it and you know exactly which sentence Google extracted from the source page — no guesswork. This is the first reproducible study to exploit this at scale.
flowchart LR
A[queries.csv 100 queries] --> B[Bright Data AI Mode Scraper]
B --> C[Raw citation JSON with #:~:text= URLs]
C --> D[Text fragment parser]
D --> E[Cited sentence per citation row]
E --> F[Source page scraper]
F --> G[Positional analysis]
G --> H[Statistical analysis + Charts]
- Collect — Use Bright Data's Google AI Mode Scraper (real SERP, not API) to gather citation URLs for 100 queries
- Parse — Decode
#:~:text=fragments to extract the exact cited sentence from every citation URL - Scrape — Fetch source pages and locate each cited sentence within the document (positional analysis)
- Analyse — Statistical tests for positional bias, sentence length preferences, structured content advantage, platform divergence
- Visualise — Publication-quality charts for the accompanying article
- Python 3.11+
- A Bright Data account with API key
- ~$6–25 budget for data collection (1,000 queries + source pages ≈ $25 max)
git clone https://github.com/yourusername/grounding-citation-analysis
cd grounding-citation-analysis
# Install dependencies
pip install -r requirements.txt
# Configure credentials
cp .env.example .env
# Edit .env and set: BRIGHTDATA_API_KEY=your_key_here# Step 1: Collect AI Mode citations
python scripts/01_collect_ai_mode.py --limit 20 # start small to verify
# Step 2: (Optional) Collect Gemini citations
# Set BRIGHTDATA_GEMINI_DATASET_ID in .env first
python scripts/02_collect_gemini.py --limit 20
# Step 3: Parse #:~:text= fragments → citations.csv
python scripts/03_parse_text_fragments.py
# Step 4: Scrape source pages (positional analysis)
python scripts/04_scrape_source_pages.py --limit 100 # start small
# Step 5: Statistical analysis
python scripts/05_analyze_patterns.py
# Step 6: Generate charts
python scripts/06_generate_charts.pygrounding-citation-analysis/
├── .env # Your Bright Data API key (not committed)
├── .gitignore
├── requirements.txt
├── README.md
│
├── queries/
│ └── queries.csv # 100 queries across 12 categories
│
├── scripts/
│ ├── 01_collect_ai_mode.py # Bright Data AI Mode Scraper trigger + poll
│ ├── 02_collect_gemini.py # Bright Data Gemini Scraper trigger + poll
│ ├── 03_parse_text_fragments.py # #:~:text= URL decoder → citations.csv
│ ├── 04_scrape_source_pages.py # Source page fetcher + positional analysis
│ ├── 05_analyze_patterns.py # Statistical analysis (scipy)
│ └── 06_generate_charts.py # Matplotlib/seaborn visualisations
│
├── data/
│ ├── raw/ # Raw Bright Data JSON snapshots
│ ├── parsed/
│ │ ├── citations.csv # One row per citation, includes cited_sentence
│ │ ├── answers.csv # One row per query/answer
│ │ └── source_pages.csv # Source page positional data
│ └── analysis/
│ ├── summary_stats.json
│ ├── positional_distribution.csv
│ ├── domain_frequency.csv
│ ├── platform_overlap.csv
│ └── category_breakdown.csv
│
├── notebooks/
│ ├── 01_methodology.ipynb
│ └── 02_findings.ipynb
│
├── reports/ # Generated charts (PNG)
│
└── article/
├── article.md # Full LinkedIn article
└── diagrams/ # Mermaid source files
| Column | Description |
|---|---|
platform |
ai_mode or gemini |
query |
The search query |
citation_url_raw |
Full URL including #:~:text= fragment |
citation_url_clean |
URL without fragment |
domain |
Cited domain |
has_text_fragment |
Boolean — does URL contain #:~:text=? |
cited_sentence |
Decoded sentence Google cited |
cited_sentence_word_count |
Word count of cited sentence |
fragment_raw |
Raw (URL-encoded) fragment |
cited_flag |
Boolean from Bright Data — marked as citation |
| Column | Description |
|---|---|
found |
Boolean — cited sentence located in page |
block_index |
Element index where sentence was found |
block_total |
Total elements in page |
relative_position |
block_index / block_total (0 = top, 1 = bottom) |
page_word_count |
Total word count of source page |
has_structured_content |
Page contains <ul>, <ol>, or <table> |
| # | Hypothesis | Test |
|---|---|---|
| H1 | Cited sentences cluster in the top 30% of documents | One-sample t-test, mean position < 0.5 |
| H2 | Cited sentences are shorter than average page text | Descriptive statistics, histogram |
| H3 | Structured pages (lists/tables) are cited more | Chi-square test |
| H4 | AI Mode and Gemini cite overlapping but distinct URLs | Jaccard similarity |
| H5 | Sentence length varies by query category | One-way ANOVA |
| Study | Sample | Granularity | Method | This Study's Advance |
|---|---|---|---|---|
| Ahrefs (2024) | 1.9M citations | Page level | Custom crawler | Sentence-level decoding |
| Surfer SEO (2024) | 46M citations | Domain level | API collection | Real SERP; positional data |
| Seer Interactive (2024) | Variable | Query level | Gemini API | Within-page position |
| DEJAN AI (2024) | Conceptual | Theoretical | Manual + LLM | Empirical validation |
| This study | 100+ queries | Sentence level | Bright Data + #:~:text= |
— |
AI Mode Scraper
- Dataset ID:
gd_mcswdt6z2elth3zqr2 - Endpoint:
POST https://api.brightdata.com/datasets/v3/trigger?dataset_id=gd_mcswdt6z2elth3zqr2 - Input:
[{"url": "https://www.google.com/search?udm=50", "prompt": "...", "country": "US"}] - Output includes
citations[].urlwith#:~:text=fragments
Why not the Gemini API?
Surfer SEO's research found that the Gemini developer API returns different answers from the actual Google Search AI Mode SERP. Bright Data's scraper hits the real google.com/search?udm=50 endpoint via residential proxies.
MIT — see LICENSE. Cite this repo if you use the methodology.
[Your Name] | [Your LinkedIn] | [Your Website]
Accompanying article: How Google Actually Chooses Which Sentences to Cite in AI Mode