Parse Google Flights shopping XHR responses#107
Conversation
📝 WalkthroughWalkthroughThis PR introduces XHR payload capture and parsing to extend flight data extraction beyond HTML scraping. A new ChangesXHR Payload Capture and Parsing
Sequence Diagram(s)sequenceDiagram
participant Client as Client / get_flights()
participant Fetcher as Fetcher
participant Integration as Integration<br/>(Playwright)
participant Browser as Browser<br/>(Chromium)
participant Parser as Parser
Client->>Fetcher: get_flights(query, integration)
Fetcher->>Integration: fetch_html(query)
Integration->>Browser: launch & navigate
Browser->>Browser: listen for XHR responses<br/>(FlightsFrontendService/...)
Browser->>Browser: capture matching<br/>response bodies
Integration->>Browser: extract final HTML
Integration->>Browser: close
Integration-->>Fetcher: FetchResult{html, xhr_bodies, url}
Fetcher->>Parser: parse(FetchResult)
alt XHR bodies present and valid
Parser->>Parser: parse_xhr_body(xhr_bodies)
Parser-->>Fetcher: MetaList with flights
else XHR missing or invalid
Parser->>Parser: _parse_html(html)
Parser-->>Fetcher: MetaList with flights or empty
end
Fetcher-->>Client: MetaList
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
fast_flights/integrations/playwright.py (1)
22-71: 💤 Low valueWrap browser usage in
try/finallyfor deterministic cleanup.
browser.close()only runs on the success path. Ifpage.goto(url, ...)at line 56 (orpage.content()at line 70) raises, the explicit close is skipped. While the surroundingwith sync_playwright()will eventually tear down the driver process on exit, atry/finallymakes the intent explicit and guarantees prompt close on any error or timeout.♻️ Proposed structure
with sync_playwright() as p: browser = p.chromium.launch(headless=True) - context = browser.new_context( - ... - ) - ... - html = page.content() - browser.close() + try: + context = browser.new_context( + ... + ) + ... + html = page.content() + finally: + browser.close()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fast_flights/integrations/playwright.py` around lines 22 - 71, Wrap the Playwright browser/context/page lifecycle inside a try/finally so resources are always cleaned up: after launching with sync_playwright() and creating browser/context/page (symbols: sync_playwright, browser, context, page), move all navigation and response handling into the try block and call browser.close() (and optionally context.close() / page.close()) in the finally block to guarantee deterministic cleanup even if page.goto or page.content raises; keep the existing page.on("response", on_response) and response_objects handling inside the try.fast_flights/parser.py (1)
274-285: ⚡ Quick win
_wrb_payloadsyields trailing metadata items as if they were payloads.A
wrb.frframe's actual JSON payload is atframe[2]. The remaining elements (frame[3:]) are typicallynull,null,null,"generic"— not parseable payloads. Yielding them causesjson.loadsto fail in the caller (parser.py lines 88-94), which setsremembered_emptyto amalformed_wrb_payloaddiagnostic and overwrites a more informative status (e.g.empty_flight_payload). It also runs_contains_error_response/_parse_payloadagainst trailing strings like"generic".If you have observed a Google response that actually places a second payload past index 2, please leave a comment; otherwise restrict to
frame[2].♻️ Proposed change
def _wrb_payloads(frame: Any) -> Iterable[Any]: if not isinstance(frame, list): return if frame and frame[0] == "wrb.fr": if len(frame) > 2 and frame[2] is not None: yield frame[2] - for item in frame[3:]: - if item is not None: - yield item return for item in frame: yield from _wrb_payloads(item)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fast_flights/parser.py` around lines 274 - 285, _wrb_payloads is yielding trailing metadata (frame[3:]) as payloads for "wrb.fr" frames; change it to yield only the canonical JSON payload at frame[2] when present (i.e., keep the existing check "if len(frame) > 2 and frame[2] is not None: yield frame[2]" and remove the subsequent loop over frame[3:]), preserving the recursion branch for non-"wrb.fr" frames; if you have evidence that Google sometimes places a second payload after index 2, add a comment noting that edge-case before expanding the yield logic.tests/test_parser_xhr.py (1)
51-65: ⚡ Quick winHardcoded byte-count prefix
123(and269in_error_body) should be computed instead.The Google writeback framing format is
")]}'\n\n<byteLen>\n<json_frame>, where<byteLen>is the UTF-8 byte length of<json_frame>. Currently,_wrb_bodyand_wrb_frame_bodyhardcode123, and_error_bodyhardcodes269, regardless of actual frame content.Although
_google_json_framestreats the byte count as a marker only (not strict framing validation), the fixtures should still mirror real responses. Compute the actual byte length instead:♻️ Proposed fix
def _wrb_body(payload): frame = [["wrb.fr", None, json.dumps(payload)]] - return ")]}'\n\n123\n" + json.dumps(frame) + encoded = json.dumps(frame) + return f")]}'\n\n{len(encoded.encode())}\n" + encoded def _wrb_frame_body(frame): - return ")]}'\n\n123\n" + json.dumps(frame) + encoded = json.dumps(frame) + return f")]}'\n\n{len(encoded.encode())}\n" + encoded🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_parser_xhr.py` around lines 51 - 65, The fixtures currently hardcode the byte-count prefixes (e.g., "123" and "269") instead of computing the UTF-8 byte length of the JSON frame; update _wrb_body and _wrb_frame_body to compute byte_len = len(json.dumps(frame).encode("utf-8")) and interpolate that value into the prefix instead of "123", and update _error_body to build the error JSON string, compute its UTF-8 byte length similarly and use that value instead of "269" so the prefix equals the actual byte length of the JSON frame.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@fast_flights/integrations/playwright.py`:
- Line 19: The code concatenates raw string queries into GOOGLE_FLIGHTS_URL
(line creating url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL +
"?q=" + q) which breaks on spaces and special chars; fix by URL-encoding the
string branch: when q is a str, encode it with urllib.parse.quote_plus or
urllib.parse.quote (to handle spaces and non-ASCII) and build the URL as
GOOGLE_FLIGHTS_URL + "?q=" + encoded_q (or use urllib.parse.urlencode with {'q':
q}) so the Query branch (q.url()) remains unchanged and all special characters
are properly escaped.
---
Nitpick comments:
In `@fast_flights/integrations/playwright.py`:
- Around line 22-71: Wrap the Playwright browser/context/page lifecycle inside a
try/finally so resources are always cleaned up: after launching with
sync_playwright() and creating browser/context/page (symbols: sync_playwright,
browser, context, page), move all navigation and response handling into the try
block and call browser.close() (and optionally context.close() / page.close())
in the finally block to guarantee deterministic cleanup even if page.goto or
page.content raises; keep the existing page.on("response", on_response) and
response_objects handling inside the try.
In `@fast_flights/parser.py`:
- Around line 274-285: _wrb_payloads is yielding trailing metadata (frame[3:])
as payloads for "wrb.fr" frames; change it to yield only the canonical JSON
payload at frame[2] when present (i.e., keep the existing check "if len(frame) >
2 and frame[2] is not None: yield frame[2]" and remove the subsequent loop over
frame[3:]), preserving the recursion branch for non-"wrb.fr" frames; if you have
evidence that Google sometimes places a second payload after index 2, add a
comment noting that edge-case before expanding the yield logic.
In `@tests/test_parser_xhr.py`:
- Around line 51-65: The fixtures currently hardcode the byte-count prefixes
(e.g., "123" and "269") instead of computing the UTF-8 byte length of the JSON
frame; update _wrb_body and _wrb_frame_body to compute byte_len =
len(json.dumps(frame).encode("utf-8")) and interpolate that value into the
prefix instead of "123", and update _error_body to build the error JSON string,
compute its UTF-8 byte length similarly and use that value instead of "269" so
the prefix equals the actual byte length of the JSON frame.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 17c80431-d6f8-402b-8ed9-eb97944331dd
📒 Files selected for processing (9)
fast_flights/__init__.pyfast_flights/fetch_result.pyfast_flights/fetcher.pyfast_flights/integrations/__init__.pyfast_flights/integrations/base.pyfast_flights/integrations/playwright.pyfast_flights/model.pyfast_flights/parser.pytests/test_parser_xhr.py
| def fetch_html(self, q: Query | str, /) -> FetchResult: | ||
| from playwright.sync_api import sync_playwright | ||
|
|
||
| url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL + "?q=" + q |
There was a problem hiding this comment.
URL-encode the raw string query.
When q is a str, the value is concatenated directly into the URL. Any spaces, &, #, +, %, or non-ASCII characters will produce a malformed URL or be parsed by Google as additional query parameters. A natural-language query like "Flights from TPE to MYJ on 2025-12-22 one way economy class" (the same example used in the get_flights docstring) will be broken on the spaces and -.
🛡️ Proposed fix
+from urllib.parse import quote_plus
+
...
- url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL + "?q=" + q
+ url = q.url() if isinstance(q, Query) else f"{GOOGLE_FLIGHTS_URL}?q={quote_plus(q)}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| url = q.url() if isinstance(q, Query) else GOOGLE_FLIGHTS_URL + "?q=" + q | |
| from urllib.parse import quote_plus | |
| ... | |
| url = q.url() if isinstance(q, Query) else f"{GOOGLE_FLIGHTS_URL}?q={quote_plus(q)}" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fast_flights/integrations/playwright.py` at line 19, The code concatenates
raw string queries into GOOGLE_FLIGHTS_URL (line creating url = q.url() if
isinstance(q, Query) else GOOGLE_FLIGHTS_URL + "?q=" + q) which breaks on spaces
and special chars; fix by URL-encoding the string branch: when q is a str,
encode it with urllib.parse.quote_plus or urllib.parse.quote (to handle spaces
and non-ASCII) and build the URL as GOOGLE_FLIGHTS_URL + "?q=" + encoded_q (or
use urllib.parse.urlencode with {'q': q}) so the Query branch (q.url()) remains
unchanged and all special characters are properly escaped.
Google Flights does not always expose parseable itinerary data in the initial
script.ds:1HTML anymore. Browser-backed integrations can still see theshopping result payload in
FlightsFrontendService/GetShoppingResultsXHRresponses.
This adds support for integrations returning both the final HTML and captured
XHR bodies, then parses
wrb.frwriteback payloads before falling back to theexisting
script.ds:1parser.The existing HTML/string parser path is kept for backwards compatibility.
Main changes:
FetchResultcontainer forhtml,xhr_bodies, andurlwrb.frXHR payloads from Google Flights shopping resultsscript.ds:1parsing as fallbackmalformed/empty payloads
GetShoppingResultsresponses
Tests added for:
script.ds:1fixturesErrorResponseXHRswrb.frpayloadsFetchResultTested with:
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes