Parse Google Flights shopping XHR responses by jeffchang5 · Pull Request #107 · AWeirdDev/flights

jeffchang5 · 2026-05-05T15:46:47Z

Google Flights does not always expose parseable itinerary data in the initial
script.ds:1 HTML anymore. Browser-backed integrations can still see the
shopping result payload in FlightsFrontendService/GetShoppingResults XHR
responses.

This adds support for integrations returning both the final HTML and captured
XHR bodies, then parses wrb.fr writeback payloads before falling back to the
existing script.ds:1 parser.

The existing HTML/string parser path is kept for backwards compatibility.

Main changes:

add a FetchResult container for html, xhr_bodies, and url
parse captured wrb.fr XHR payloads from Google Flights shopping results
keep legacy script.ds:1 parsing as fallback
add diagnostic metadata for missing script data, Google error responses, and
malformed/empty payloads
update the local Playwright integration to capture GetShoppingResults
responses

Tests added for:

existing script.ds:1 fixtures
missing script data
Google ErrorResponse XHRs
captured wrb.fr payloads
integrations returning FetchResult
later valid XHR payloads after an earlier error response

Tested with:

python -m pytest tests -q

Summary by CodeRabbit

Release Notes

New Features
- Added browser-based flight search integration with automatic capture of flight data responses.
- Enhanced parser to handle captured XHR responses alongside HTML content.
- Improved diagnostic reporting for parsing failures and missing data.
Bug Fixes
- Strengthened parsing resilience with better handling of malformed payloads and edge cases.

coderabbitai · 2026-05-05T15:46:58Z

📝 Walkthrough

Walkthrough

This PR introduces XHR payload capture and parsing to extend flight data extraction beyond HTML scraping. A new FetchResult dataclass bundles HTML and captured HTTP response bodies; a Playwright integration captures matching XHR responses; the parser is refactored to accept and parse both HTML and XHR payloads with diagnostic metadata; and related exports and type signatures are updated accordingly.

Changes

XHR Payload Capture and Parsing

Layer / File(s)	Summary
Data Shape `fast_flights/fetch_result.py`, `fast_flights/__init__.py`	New `FetchResult` dataclass bundles `html` (string), `xhr_bodies` (list), and optional `url`. Exported from package root.
Integration Interface `fast_flights/integrations/base.py`	`Integration.fetch_html` return type widened from `str` to `str \| FetchResult`, enabling implementations to return either HTML or structured payload data.
Playwright Integration `fast_flights/integrations/playwright.py`, `fast_flights/integrations/__init__.py`	New `Playwright` integration launches headless Chromium, listens for XHR responses matching `FlightsFrontendService/GetShoppingResults`, collects response bodies, and returns a `FetchResult` with final HTML, captured XHR bodies, and effective URL. Exported from integrations package.
Parser Refactoring `fast_flights/parser.py`, `fast_flights/model.py`	`parse()` now accepts `str \| bytes \| FetchResult` and routes to either XHR parsing (`parse_xhr_body`) or HTML parsing based on input type. New XHR parser extracts JSON frames, detects error responses, and recursively traverses writeback payloads. HTML parsing split into helper functions with defensive script/data extraction. Payload parsing rewritten with safe indexing, de-duplication, and error tracking. `JsMetadata` adds optional `diagnostics` field to record parsing status, source, and error counts.
Fetcher Wiring `fast_flights/fetcher.py`	`fetch_flights_html` return type updated to `str \| FetchResult`. `get_flights` now stores the result in a local variable before parsing, accommodating both return types transparently.
Tests and Validation `tests/test_parser_xhr.py`	New comprehensive test module with fixtures for DS:1 scripts, WRB/XHR payloads, and error responses. Tests verify HTML parsing still works, missing scripts yield diagnostic metadata, XHR errors are detected, captured payloads populate flight models, integration path uses XHR payloads, and later valid XHR responses override earlier errors.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client / get_flights()
    participant Fetcher as Fetcher
    participant Integration as Integration<br/>(Playwright)
    participant Browser as Browser<br/>(Chromium)
    participant Parser as Parser

    Client->>Fetcher: get_flights(query, integration)
    Fetcher->>Integration: fetch_html(query)
    Integration->>Browser: launch & navigate
    Browser->>Browser: listen for XHR responses<br/>(FlightsFrontendService/...)
    Browser->>Browser: capture matching<br/>response bodies
    Integration->>Browser: extract final HTML
    Integration->>Browser: close
    Integration-->>Fetcher: FetchResult{html, xhr_bodies, url}
    Fetcher->>Parser: parse(FetchResult)
    alt XHR bodies present and valid
        Parser->>Parser: parse_xhr_body(xhr_bodies)
        Parser-->>Fetcher: MetaList with flights
    else XHR missing or invalid
        Parser->>Parser: _parse_html(html)
        Parser-->>Fetcher: MetaList with flights or empty
    end
    Fetcher-->>Client: MetaList

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A playwright script hops through the bytes,
XHR payloads caught in chrome-tinted lights,
Diagnostics bundled, each flight takes flight,
HTML and responses dance side by side,
One parser to parse them all—what a ride! 🛫✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.36% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding support for parsing Google Flights XHR shopping responses, which aligns with the PR's core objective.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

fast_flights/integrations/playwright.py (1)
22-71: 💤 Low value

Wrap browser usage in try/finally for deterministic cleanup.

browser.close() only runs on the success path. If page.goto(url, ...) at line 56 (or page.content() at line 70) raises, the explicit close is skipped. While the surrounding with sync_playwright() will eventually tear down the driver process on exit, a try/finally makes the intent explicit and guarantees prompt close on any error or timeout.
♻️ Proposed structure
         with sync_playwright() as p:
             browser = p.chromium.launch(headless=True)
-            context = browser.new_context(
-                ...
-            )
-            ...
-            html = page.content()
-            browser.close()
+            try:
+                context = browser.new_context(
+                    ...
+                )
+                ...
+                html = page.content()
+            finally:
+                browser.close()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fast_flights/integrations/playwright.py` around lines 22 - 71, Wrap the
Playwright browser/context/page lifecycle inside a try/finally so resources are
always cleaned up: after launching with sync_playwright() and creating
browser/context/page (symbols: sync_playwright, browser, context, page), move
all navigation and response handling into the try block and call browser.close()
(and optionally context.close() / page.close()) in the finally block to
guarantee deterministic cleanup even if page.goto or page.content raises; keep
the existing page.on("response", on_response) and response_objects handling
inside the try.
fast_flights/parser.py (1)
274-285: ⚡ Quick win

_wrb_payloads yields trailing metadata items as if they were payloads.

A wrb.fr frame's actual JSON payload is at frame[2]. The remaining elements (frame[3:]) are typically null, null, null, "generic" — not parseable payloads. Yielding them causes json.loads to fail in the caller (parser.py lines 88-94), which sets remembered_empty to a malformed_wrb_payload diagnostic and overwrites a more informative status (e.g. empty_flight_payload). It also runs _contains_error_response / _parse_payload against trailing strings like "generic".

If you have observed a Google response that actually places a second payload past index 2, please leave a comment; otherwise restrict to frame[2].
♻️ Proposed change
 def _wrb_payloads(frame: Any) -> Iterable[Any]:
     if not isinstance(frame, list):
         return
     if frame and frame[0] == "wrb.fr":
         if len(frame) > 2 and frame[2] is not None:
             yield frame[2]
-        for item in frame[3:]:
-            if item is not None:
-                yield item
         return
     for item in frame:
         yield from _wrb_payloads(item)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fast_flights/parser.py` around lines 274 - 285, _wrb_payloads is yielding
trailing metadata (frame[3:]) as payloads for "wrb.fr" frames; change it to
yield only the canonical JSON payload at frame[2] when present (i.e., keep the
existing check "if len(frame) > 2 and frame[2] is not None: yield frame[2]" and
remove the subsequent loop over frame[3:]), preserving the recursion branch for
non-"wrb.fr" frames; if you have evidence that Google sometimes places a second
payload after index 2, add a comment noting that edge-case before expanding the
yield logic.
tests/test_parser_xhr.py (1)
51-65: ⚡ Quick win

Hardcoded byte-count prefix 123 (and 269 in _error_body) should be computed instead.

The Google writeback framing format is ")]}'\n\n<byteLen>\n<json_frame>, where <byteLen> is the UTF-8 byte length of <json_frame>. Currently, _wrb_body and _wrb_frame_body hardcode 123, and _error_body hardcodes 269, regardless of actual frame content.

Although _google_json_frames treats the byte count as a marker only (not strict framing validation), the fixtures should still mirror real responses. Compute the actual byte length instead:
♻️ Proposed fix
 def _wrb_body(payload):
     frame = [["wrb.fr", None, json.dumps(payload)]]
-    return ")]}'\n\n123\n" + json.dumps(frame)
+    encoded = json.dumps(frame)
+    return f")]}'\n\n{len(encoded.encode())}\n" + encoded


 def _wrb_frame_body(frame):
-    return ")]}'\n\n123\n" + json.dumps(frame)
+    encoded = json.dumps(frame)
+    return f")]}'\n\n{len(encoded.encode())}\n" + encoded
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_parser_xhr.py` around lines 51 - 65, The fixtures currently
hardcode the byte-count prefixes (e.g., "123" and "269") instead of computing
the UTF-8 byte length of the JSON frame; update _wrb_body and _wrb_frame_body to
compute byte_len = len(json.dumps(frame).encode("utf-8")) and interpolate that
value into the prefix instead of "123", and update _error_body to build the
error JSON string, compute its UTF-8 byte length similarly and use that value
instead of "269" so the prefix equals the actual byte length of the JSON frame.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@fast_flights/integrations/playwright.py`:
- Line 19: The code concatenates raw string queries into GOOGLE_FLIGHTS_URL
(line creating url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL +
"?q=" + q) which breaks on spaces and special chars; fix by URL-encoding the
string branch: when q is a str, encode it with urllib.parse.quote_plus or
urllib.parse.quote (to handle spaces and non-ASCII) and build the URL as
GOOGLE_FLIGHTS_URL + "?q=" + encoded_q (or use urllib.parse.urlencode with {'q':
q}) so the Query branch (q.url()) remains unchanged and all special characters
are properly escaped.

---

Nitpick comments:
In `@fast_flights/integrations/playwright.py`:
- Around line 22-71: Wrap the Playwright browser/context/page lifecycle inside a
try/finally so resources are always cleaned up: after launching with
sync_playwright() and creating browser/context/page (symbols: sync_playwright,
browser, context, page), move all navigation and response handling into the try
block and call browser.close() (and optionally context.close() / page.close())
in the finally block to guarantee deterministic cleanup even if page.goto or
page.content raises; keep the existing page.on("response", on_response) and
response_objects handling inside the try.

In `@fast_flights/parser.py`:
- Around line 274-285: _wrb_payloads is yielding trailing metadata (frame[3:])
as payloads for "wrb.fr" frames; change it to yield only the canonical JSON
payload at frame[2] when present (i.e., keep the existing check "if len(frame) >
2 and frame[2] is not None: yield frame[2]" and remove the subsequent loop over
frame[3:]), preserving the recursion branch for non-"wrb.fr" frames; if you have
evidence that Google sometimes places a second payload after index 2, add a
comment noting that edge-case before expanding the yield logic.

In `@tests/test_parser_xhr.py`:
- Around line 51-65: The fixtures currently hardcode the byte-count prefixes
(e.g., "123" and "269") instead of computing the UTF-8 byte length of the JSON
frame; update _wrb_body and _wrb_frame_body to compute byte_len =
len(json.dumps(frame).encode("utf-8")) and interpolate that value into the
prefix instead of "123", and update _error_body to build the error JSON string,
compute its UTF-8 byte length similarly and use that value instead of "269" so
the prefix equals the actual byte length of the JSON frame.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 17c80431-d6f8-402b-8ed9-eb97944331dd

📥 Commits

Reviewing files that changed from the base of the PR and between 0138641 and cb0a6a5.

📒 Files selected for processing (9)

fast_flights/__init__.py
fast_flights/fetch_result.py
fast_flights/fetcher.py
fast_flights/integrations/__init__.py
fast_flights/integrations/base.py
fast_flights/integrations/playwright.py
fast_flights/model.py
fast_flights/parser.py
tests/test_parser_xhr.py

coderabbitai · 2026-05-05T15:56:38Z

+    def fetch_html(self, q: Query | str, /) -> FetchResult:
+        from playwright.sync_api import sync_playwright
+
+        url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL + "?q=" + q


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

URL-encode the raw string query.

When q is a str, the value is concatenated directly into the URL. Any spaces, &, #, +, %, or non-ASCII characters will produce a malformed URL or be parsed by Google as additional query parameters. A natural-language query like "Flights from TPE to MYJ on 2025-12-22 one way economy class" (the same example used in the get_flights docstring) will be broken on the spaces and -.

🛡️ Proposed fix

+from urllib.parse import quote_plus + ... - url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL + "?q=" + q + url = q.url() if isinstance(q, Query) else f"{GOOGLE_FLIGHTS_URL}?q={quote_plus(q)}"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL + "?q=" + q

from urllib.parse import quote_plus

...

url = q.url() if isinstance(q, Query) else f"{GOOGLE_FLIGHTS_URL}?q={quote_plus(q)}"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fast_flights/integrations/playwright.py` at line 19, The code concatenates raw string queries into GOOGLE_FLIGHTS_URL (line creating url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL + "?q=" + q) which breaks on spaces and special chars; fix by URL-encoding the string branch: when q is a str, encode it with urllib.parse.quote_plus or urllib.parse.quote (to handle spaces and non-ASCII) and build the URL as GOOGLE_FLIGHTS_URL + "?q=" + encoded_q (or use urllib.parse.urlencode with {'q': q}) so the Query branch (q.url()) remains unchanged and all special characters are properly escaped.

Parse Google Flights shopping XHR responses

cb0a6a5

jeffchang5 marked this pull request as ready for review May 5, 2026 15:48

dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels May 5, 2026

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse Google Flights shopping XHR responses#107

Parse Google Flights shopping XHR responses#107
jeffchang5 wants to merge 1 commit into
AWeirdDev:devfrom
jeffchang5:codex/google-flights-xhr-parser

jeffchang5 commented May 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL + "?q=" + q
+from urllib.parse import quote_plus
+...
+        url = q.url() if isinstance(q, Query) else f"{GOOGLE_FLIGHTS_URL}?q={quote_plus(q)}"

Conversation

jeffchang5 commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeffchang5 commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading