Skip to content

Parse Google Flights shopping XHR responses#107

Open
jeffchang5 wants to merge 1 commit into
AWeirdDev:devfrom
jeffchang5:codex/google-flights-xhr-parser
Open

Parse Google Flights shopping XHR responses#107
jeffchang5 wants to merge 1 commit into
AWeirdDev:devfrom
jeffchang5:codex/google-flights-xhr-parser

Conversation

@jeffchang5

@jeffchang5 jeffchang5 commented May 5, 2026

Copy link
Copy Markdown

Google Flights does not always expose parseable itinerary data in the initial
script.ds:1 HTML anymore. Browser-backed integrations can still see the
shopping result payload in FlightsFrontendService/GetShoppingResults XHR
responses.

This adds support for integrations returning both the final HTML and captured
XHR bodies, then parses wrb.fr writeback payloads before falling back to the
existing script.ds:1 parser.

The existing HTML/string parser path is kept for backwards compatibility.

Main changes:

  • add a FetchResult container for html, xhr_bodies, and url
  • parse captured wrb.fr XHR payloads from Google Flights shopping results
  • keep legacy script.ds:1 parsing as fallback
  • add diagnostic metadata for missing script data, Google error responses, and
    malformed/empty payloads
  • update the local Playwright integration to capture GetShoppingResults
    responses

Tests added for:

  • existing script.ds:1 fixtures
  • missing script data
  • Google ErrorResponse XHRs
  • captured wrb.fr payloads
  • integrations returning FetchResult
  • later valid XHR payloads after an earlier error response

Tested with:

python -m pytest tests -q

Summary by CodeRabbit

Release Notes

  • New Features

    • Added browser-based flight search integration with automatic capture of flight data responses.
    • Enhanced parser to handle captured XHR responses alongside HTML content.
    • Improved diagnostic reporting for parsing failures and missing data.
  • Bug Fixes

    • Strengthened parsing resilience with better handling of malformed payloads and edge cases.

@coderabbitai

coderabbitai Bot commented May 5, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

This PR introduces XHR payload capture and parsing to extend flight data extraction beyond HTML scraping. A new FetchResult dataclass bundles HTML and captured HTTP response bodies; a Playwright integration captures matching XHR responses; the parser is refactored to accept and parse both HTML and XHR payloads with diagnostic metadata; and related exports and type signatures are updated accordingly.

Changes

XHR Payload Capture and Parsing

Layer / File(s) Summary
Data Shape
fast_flights/fetch_result.py, fast_flights/__init__.py
New FetchResult dataclass bundles html (string), xhr_bodies (list), and optional url. Exported from package root.
Integration Interface
fast_flights/integrations/base.py
Integration.fetch_html return type widened from str to str | FetchResult, enabling implementations to return either HTML or structured payload data.
Playwright Integration
fast_flights/integrations/playwright.py, fast_flights/integrations/__init__.py
New Playwright integration launches headless Chromium, listens for XHR responses matching FlightsFrontendService/GetShoppingResults, collects response bodies, and returns a FetchResult with final HTML, captured XHR bodies, and effective URL. Exported from integrations package.
Parser Refactoring
fast_flights/parser.py, fast_flights/model.py
parse() now accepts str | bytes | FetchResult and routes to either XHR parsing (parse_xhr_body) or HTML parsing based on input type. New XHR parser extracts JSON frames, detects error responses, and recursively traverses writeback payloads. HTML parsing split into helper functions with defensive script/data extraction. Payload parsing rewritten with safe indexing, de-duplication, and error tracking. JsMetadata adds optional diagnostics field to record parsing status, source, and error counts.
Fetcher Wiring
fast_flights/fetcher.py
fetch_flights_html return type updated to str | FetchResult. get_flights now stores the result in a local variable before parsing, accommodating both return types transparently.
Tests and Validation
tests/test_parser_xhr.py
New comprehensive test module with fixtures for DS:1 scripts, WRB/XHR payloads, and error responses. Tests verify HTML parsing still works, missing scripts yield diagnostic metadata, XHR errors are detected, captured payloads populate flight models, integration path uses XHR payloads, and later valid XHR responses override earlier errors.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client / get_flights()
    participant Fetcher as Fetcher
    participant Integration as Integration<br/>(Playwright)
    participant Browser as Browser<br/>(Chromium)
    participant Parser as Parser

    Client->>Fetcher: get_flights(query, integration)
    Fetcher->>Integration: fetch_html(query)
    Integration->>Browser: launch & navigate
    Browser->>Browser: listen for XHR responses<br/>(FlightsFrontendService/...)
    Browser->>Browser: capture matching<br/>response bodies
    Integration->>Browser: extract final HTML
    Integration->>Browser: close
    Integration-->>Fetcher: FetchResult{html, xhr_bodies, url}
    Fetcher->>Parser: parse(FetchResult)
    alt XHR bodies present and valid
        Parser->>Parser: parse_xhr_body(xhr_bodies)
        Parser-->>Fetcher: MetaList with flights
    else XHR missing or invalid
        Parser->>Parser: _parse_html(html)
        Parser-->>Fetcher: MetaList with flights or empty
    end
    Fetcher-->>Client: MetaList
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A playwright script hops through the bytes,
XHR payloads caught in chrome-tinted lights,
Diagnostics bundled, each flight takes flight,
HTML and responses dance side by side,
One parser to parse them all—what a ride! 🛫✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding support for parsing Google Flights XHR shopping responses, which aligns with the PR's core objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jeffchang5 jeffchang5 marked this pull request as ready for review May 5, 2026 15:48
@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels May 5, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
fast_flights/integrations/playwright.py (1)

22-71: 💤 Low value

Wrap browser usage in try/finally for deterministic cleanup.

browser.close() only runs on the success path. If page.goto(url, ...) at line 56 (or page.content() at line 70) raises, the explicit close is skipped. While the surrounding with sync_playwright() will eventually tear down the driver process on exit, a try/finally makes the intent explicit and guarantees prompt close on any error or timeout.

♻️ Proposed structure
         with sync_playwright() as p:
             browser = p.chromium.launch(headless=True)
-            context = browser.new_context(
-                ...
-            )
-            ...
-            html = page.content()
-            browser.close()
+            try:
+                context = browser.new_context(
+                    ...
+                )
+                ...
+                html = page.content()
+            finally:
+                browser.close()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fast_flights/integrations/playwright.py` around lines 22 - 71, Wrap the
Playwright browser/context/page lifecycle inside a try/finally so resources are
always cleaned up: after launching with sync_playwright() and creating
browser/context/page (symbols: sync_playwright, browser, context, page), move
all navigation and response handling into the try block and call browser.close()
(and optionally context.close() / page.close()) in the finally block to
guarantee deterministic cleanup even if page.goto or page.content raises; keep
the existing page.on("response", on_response) and response_objects handling
inside the try.
fast_flights/parser.py (1)

274-285: ⚡ Quick win

_wrb_payloads yields trailing metadata items as if they were payloads.

A wrb.fr frame's actual JSON payload is at frame[2]. The remaining elements (frame[3:]) are typically null, null, null, "generic" — not parseable payloads. Yielding them causes json.loads to fail in the caller (parser.py lines 88-94), which sets remembered_empty to a malformed_wrb_payload diagnostic and overwrites a more informative status (e.g. empty_flight_payload). It also runs _contains_error_response / _parse_payload against trailing strings like "generic".

If you have observed a Google response that actually places a second payload past index 2, please leave a comment; otherwise restrict to frame[2].

♻️ Proposed change
 def _wrb_payloads(frame: Any) -> Iterable[Any]:
     if not isinstance(frame, list):
         return
     if frame and frame[0] == "wrb.fr":
         if len(frame) > 2 and frame[2] is not None:
             yield frame[2]
-        for item in frame[3:]:
-            if item is not None:
-                yield item
         return
     for item in frame:
         yield from _wrb_payloads(item)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fast_flights/parser.py` around lines 274 - 285, _wrb_payloads is yielding
trailing metadata (frame[3:]) as payloads for "wrb.fr" frames; change it to
yield only the canonical JSON payload at frame[2] when present (i.e., keep the
existing check "if len(frame) > 2 and frame[2] is not None: yield frame[2]" and
remove the subsequent loop over frame[3:]), preserving the recursion branch for
non-"wrb.fr" frames; if you have evidence that Google sometimes places a second
payload after index 2, add a comment noting that edge-case before expanding the
yield logic.
tests/test_parser_xhr.py (1)

51-65: ⚡ Quick win

Hardcoded byte-count prefix 123 (and 269 in _error_body) should be computed instead.

The Google writeback framing format is ")]}'\n\n<byteLen>\n<json_frame>, where <byteLen> is the UTF-8 byte length of <json_frame>. Currently, _wrb_body and _wrb_frame_body hardcode 123, and _error_body hardcodes 269, regardless of actual frame content.

Although _google_json_frames treats the byte count as a marker only (not strict framing validation), the fixtures should still mirror real responses. Compute the actual byte length instead:

♻️ Proposed fix
 def _wrb_body(payload):
     frame = [["wrb.fr", None, json.dumps(payload)]]
-    return ")]}'\n\n123\n" + json.dumps(frame)
+    encoded = json.dumps(frame)
+    return f")]}'\n\n{len(encoded.encode())}\n" + encoded


 def _wrb_frame_body(frame):
-    return ")]}'\n\n123\n" + json.dumps(frame)
+    encoded = json.dumps(frame)
+    return f")]}'\n\n{len(encoded.encode())}\n" + encoded
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_parser_xhr.py` around lines 51 - 65, The fixtures currently
hardcode the byte-count prefixes (e.g., "123" and "269") instead of computing
the UTF-8 byte length of the JSON frame; update _wrb_body and _wrb_frame_body to
compute byte_len = len(json.dumps(frame).encode("utf-8")) and interpolate that
value into the prefix instead of "123", and update _error_body to build the
error JSON string, compute its UTF-8 byte length similarly and use that value
instead of "269" so the prefix equals the actual byte length of the JSON frame.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@fast_flights/integrations/playwright.py`:
- Line 19: The code concatenates raw string queries into GOOGLE_FLIGHTS_URL
(line creating url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL +
"?q=" + q) which breaks on spaces and special chars; fix by URL-encoding the
string branch: when q is a str, encode it with urllib.parse.quote_plus or
urllib.parse.quote (to handle spaces and non-ASCII) and build the URL as
GOOGLE_FLIGHTS_URL + "?q=" + encoded_q (or use urllib.parse.urlencode with {'q':
q}) so the Query branch (q.url()) remains unchanged and all special characters
are properly escaped.

---

Nitpick comments:
In `@fast_flights/integrations/playwright.py`:
- Around line 22-71: Wrap the Playwright browser/context/page lifecycle inside a
try/finally so resources are always cleaned up: after launching with
sync_playwright() and creating browser/context/page (symbols: sync_playwright,
browser, context, page), move all navigation and response handling into the try
block and call browser.close() (and optionally context.close() / page.close())
in the finally block to guarantee deterministic cleanup even if page.goto or
page.content raises; keep the existing page.on("response", on_response) and
response_objects handling inside the try.

In `@fast_flights/parser.py`:
- Around line 274-285: _wrb_payloads is yielding trailing metadata (frame[3:])
as payloads for "wrb.fr" frames; change it to yield only the canonical JSON
payload at frame[2] when present (i.e., keep the existing check "if len(frame) >
2 and frame[2] is not None: yield frame[2]" and remove the subsequent loop over
frame[3:]), preserving the recursion branch for non-"wrb.fr" frames; if you have
evidence that Google sometimes places a second payload after index 2, add a
comment noting that edge-case before expanding the yield logic.

In `@tests/test_parser_xhr.py`:
- Around line 51-65: The fixtures currently hardcode the byte-count prefixes
(e.g., "123" and "269") instead of computing the UTF-8 byte length of the JSON
frame; update _wrb_body and _wrb_frame_body to compute byte_len =
len(json.dumps(frame).encode("utf-8")) and interpolate that value into the
prefix instead of "123", and update _error_body to build the error JSON string,
compute its UTF-8 byte length similarly and use that value instead of "269" so
the prefix equals the actual byte length of the JSON frame.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 17c80431-d6f8-402b-8ed9-eb97944331dd

📥 Commits

Reviewing files that changed from the base of the PR and between 0138641 and cb0a6a5.

📒 Files selected for processing (9)
  • fast_flights/__init__.py
  • fast_flights/fetch_result.py
  • fast_flights/fetcher.py
  • fast_flights/integrations/__init__.py
  • fast_flights/integrations/base.py
  • fast_flights/integrations/playwright.py
  • fast_flights/model.py
  • fast_flights/parser.py
  • tests/test_parser_xhr.py

def fetch_html(self, q: Query | str, /) -> FetchResult:
from playwright.sync_api import sync_playwright

url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL + "?q=" + q

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

URL-encode the raw string query.

When q is a str, the value is concatenated directly into the URL. Any spaces, &, #, +, %, or non-ASCII characters will produce a malformed URL or be parsed by Google as additional query parameters. A natural-language query like "Flights from TPE to MYJ on 2025-12-22 one way economy class" (the same example used in the get_flights docstring) will be broken on the spaces and -.

🛡️ Proposed fix
+from urllib.parse import quote_plus
+
 ...
-        url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL + "?q=" + q
+        url = q.url() if isinstance(q, Query) else f"{GOOGLE_FLIGHTS_URL}?q={quote_plus(q)}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL + "?q=" + q
from urllib.parse import quote_plus
...
url = q.url() if isinstance(q, Query) else f"{GOOGLE_FLIGHTS_URL}?q={quote_plus(q)}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fast_flights/integrations/playwright.py` at line 19, The code concatenates
raw string queries into GOOGLE_FLIGHTS_URL (line creating url = q.url() if
isinstance(q, Query) else GOOGLE_FLIGHTS_URL + "?q=" + q) which breaks on spaces
and special chars; fix by URL-encoding the string branch: when q is a str,
encode it with urllib.parse.quote_plus or urllib.parse.quote (to handle spaces
and non-ASCII) and build the URL as GOOGLE_FLIGHTS_URL + "?q=" + encoded_q (or
use urllib.parse.urlencode with {'q': q}) so the Query branch (q.url()) remains
unchanged and all special characters are properly escaped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant