Determine how to handle various User Agent situations #44
Description
In linehaul, there are 3 states that any particular event can be in:
- The user agent is parseable for data.
- The user agent is unknown.
- The user agent is known, but it's not parseable for data.
For (1) the correct outcome is obvious, we have data so we want to save it in BigQuery.
For (2) the current thing we do is record a download, but with all of the data that typically comes from the user agent missing. Thus the BigQuery table more accurately reflects all of the downloads, but projects querying the data needs to be more careful about how it queries the data (it's easy to do something like py3_downloads / total_downloads
, however that would incorrectly give a smaller percentage, since it would count unknown as py2). Prior to Linehaul v3, the behavior was to throw away this event and not log anything for it.
For (3) Linehaul v3 and previous throw away the event (we implement this as "ignored" user agents). The list of these can be found at:
So ultimately the question is, do these behaviors make sense? Which boils down to whether we want BigQuery to most accurately reflect every download, or whether we want to filter the data to data that is more usable for specific questions (but of course, less usable for other questions that are likely to be more of an edge case).
Thoughts?