Skip to content

feat(api): cursor-based pagination for GET /flows/{flow_id}/runs (#322)#463

Open
GunaPalanivel wants to merge 2 commits intoNetflix:masterfrom
GunaPalanivel:fix/issue-322-paginate-flow-runs
Open

feat(api): cursor-based pagination for GET /flows/{flow_id}/runs (#322)#463
GunaPalanivel wants to merge 2 commits intoNetflix:masterfrom
GunaPalanivel:fix/issue-322-paginate-flow-runs

Conversation

@GunaPalanivel
Copy link
Copy Markdown

@GunaPalanivel GunaPalanivel commented Feb 20, 2026

Description

This PR adds cursor-based pagination to GET /flows/{flow_id}/runs using run_number as the cursor.

Response body shape is unchanged (still a flat array of runs).
When pagination is enabled, metadata is returned through headers:

  • Link with rel="next" (RFC 5988 style)
  • X-Total-Count
  • X-Pagination-Limit

Backward Compatibility

Backward compatibility is preserved by default:

  • If _limit is omitted, endpoint keeps legacy behavior and returns all runs without pagination headers.
  • If _limit is provided, it must be a positive integer and pagination is applied.
  • _limit=0 is now treated as invalid (400) to avoid ambiguous semantics.

Also, _after is only valid when _limit is provided.

Problem

/flows/{flow_id}/runs currently returns all runs in one unbounded response.
For large flows this can lead to:

  • API Gateway payload-size failures
  • high latency
  • unnecessary memory pressure
  • no way to consume results incrementally

Design

Cursor Choice

run_number is monotonic and stable for pagination.

  • order is run_number DESC (newest first)
  • next cursor is the last run_number from the current page
  • Link is emitted only when current page is full (possible next page)

Count Behavior and Trade-off

X-Total-Count is produced with a separate COUNT(*) query scoped by flow_id.

  • For current query shape, this is acceptable and index-backed.
  • For future tag-based JSONB filters, we will need to revisit strategy (for example GIN index and/or different count behavior when tag filters are active).

X-Total-Count is best-effort metadata, not a strict transactional guarantee under concurrent writes.

Validation and Errors

Returns 400 for invalid pagination inputs:

  • non-integer _limit
  • _limit <= 0
  • non-integer _after
  • _after <= 0
  • _after provided without _limit

Tests

Integration coverage includes:

  • omitted _limit -> legacy unbounded behavior, no pagination headers
  • _limit=2 -> multi-page traversal with Link next and no duplicates
  • ordering remains newest-first
  • invalid params return 400 (including _limit=0)
  • non-existent flow returns 200 with empty array

Scope

This PR is intentionally scoped to runs endpoint pagination only.
Runs are the endpoint with unbounded growth and immediate payload-risk; other resources can be handled in follow-ups if needed.

AI Tool Usage

AI assistance was used during implementation.
All changes were reviewed manually, validated locally, and adjusted based on reviewer feedback.

Introduce RFC 5988 cursor-based pagination using run_number as the
cursor field. Responses include Link, X-Total-Count, and
X-Pagination-Limit headers. Legacy _limit=0 behavior is preserved
for backward compatibility.

- Add get_all_runs_paginated() with run_number DESC ordering
- Add count_records() for efficient X-Total-Count without full fetch
- Harden input validation: reject negative _limit and non-positive _after
- Validate column names in count_records against self.keys
- Add 5-case integration test covering multi-page iteration, legacy
  opt-out, invalid params, and empty flows
- Make init_db gracefully fall back to direct table check when goose
  binary is unavailable (supports local development without Docker)

Fixes Netflix#322
@GunaPalanivel
Copy link
Copy Markdown
Author

@romain-intel could you trigger CI when you get a chance?

All integration tests pass locally (8 passed, 0 regressions).
Happy to address any review feedback.

)
return response

async def count_records(self, filter_dict=None) -> int:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any performance concerns with the COUNT approach for providing pagination data? there is a plan to introduce filtering based on tags as well, which would require matching values in the JSONB tags column.

Copy link
Copy Markdown
Author

@GunaPalanivel GunaPalanivel Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR X-Total-Count uses a COUNT(*) scoped to flow_id, which is acceptable for the current query shape. For future tag-based JSONB filtering, I agree we should revisit with a dedicated strategy (likely GIN index plus possibly different total-count behavior under tag filters). I documented this trade-off in code and kept it out of scope for this PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not necessary to add pagination to anything besides runs?

Copy link
Copy Markdown
Author

@GunaPalanivel GunaPalanivel Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I scoped this PR to runs because runs are the unbounded-growth endpoint and the one currently at risk for oversized responses. Steps/tasks have different bounded patterns. I kept the implementation reusable so adding pagination to other endpoints in follow-up PRs is straightforward.

Comment thread services/metadata_service/api/run.py Outdated
after_run_number = None

# --- Legacy opt-out (_limit=0): unbounded query, no headers -------
if limit == 0:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we ensure backwards compatibility here? it seems the limit is never 0 if the client does not explicitly set it to the value.

Copy link
Copy Markdown
Author

@GunaPalanivel GunaPalanivel Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. _limit=0 is not a true backward-compat path unless explicitly sent. I updated behavior so compatibility is preserved when _limit is omitted (legacy unbounded path). I removed _limit=0 opt-out and now require positive _limit when pagination is requested; tests were updated accordingly.

@Aryan95614
Copy link
Copy Markdown

Nice approach. Using run_number as the cursor key is actually cleaner for ordering stability than ts_epoch since it's a monotonic PK with no tie risk. I went with ts_epoch on the GSoC fork (saikonen/metaflow-service PR #9) because it's available on all six table types (flows, steps, tasks, artifacts, metadata all have ts_epoch but not all have a sequential PK equivalent). Tradeoff is needing a compound cursor for tiebreaking which I raised in Issue #11 on the GSoC fork. Curious if you're planning to extend beyond runs to the other endpoints?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

500 error encountered on flows/<flow_id>/runs requests due to size of payload

3 participants