-
Notifications
You must be signed in to change notification settings - Fork 198
Description
Description
Logfire SDK has query API (see https://logfire.pydantic.dev/docs/how-to-guides/query-api/) that allows downloading data, but it can only download up to 10K rows (the limit= arg allows max value of 10000).
At the same time, the logfire client doesn't have built-in support for pagination (and implementing e.g. cursor-based pagination correctly isn't trivial).
To allow downloads without an artificial limit of 10K rows (which can easily be exceeded when e.g. downloading some historical data for further analysis), having built-in support for pagination and/or corresponding documentation would be useful.
Some insights I got from a discussion with logfire team (@dmontagu) on this topic:
Would using ORDER BY start_timestamp, trace_id, span_id in my queries in combination with
WHERE (start_timestamp, trace_id, span_id) > (last_record_timestamp, last_record_trace_id, last_record_span_id)work for reliable pagination?
this approach is correct as long as you aren't inserting new rows that have older start_timestamp after you are paginating through those rows. In particular, if you were paginating through data from an hour ago, and go through pages from one hour ago to two hours ago, and then a new row gets inserted now with a start_timestamp that is 90 minutes old, obviously that row will get missed.
The solution to that is either to only paginate through times that are old enough that you know rows won't get dropped (e.g., if you are confident you aren't sending data through more than e.g. 5 minutes delayed, you could paginate through start_timestamps that are at least 5 minutes old), or to use the created_at column for pagination, which is the timestamp at which the row was added to the logfire database. (If you use created_at, you'll probably still want to incorporate some small delay like at least ~10 seconds to account for variable latency in when we receive the payload and when it becomes queryable, but I think in principle this discrepancy should be very small. Using a larger value like >=1 minute is probably safest, but I'm not sure whether that would be a problematic amount of latency for your application).
To be clear, if you want to use created_at as the column, that would just be substituted for start_timestamp; you'd still want to also include trace_id and span_id. I'd also actually include kind to be safe if there's any chance that you are paginating over a time range during which there could be both open and closed spans (this is more of an issue if you are purposely paginating over time ranges where you expect there to be spans that start open and get closed during the course of your pagination).
So basically, here are my suggested guidelines:
If you are paginating over a time range of data where you are confident there won't be new data received that is within that time range while you are paginating, you can safely use (start_timestamp, trace_id, span_id) as a cursor.
If you are paginating over a time range of data where you expect there may be new data received that would fall into the relevant time range, it may be safer to paginate over (created_at, trace_id, span_id, kind).
I think the main reason not to always use this approach is if it makes the code harder to understand because it's using a new logfire-internal timestamp for pagination and accounting for pending vs. non-pending spans. And it's just unnecessary if you are retrieving data from a time range where the data will definitely have already been received.