Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Block Lighthouse cryptic user agent #1779

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rafaeelaudibert
Copy link
Member

@rafaeelaudibert rafaeelaudibert commented Mar 3, 2025

Believe it or not, this is Lighthouse's user agent. They don't usually include "lighthouse" in it anymore because some people were gaming the system for LH scores, so they use some cryptic UAs now. I'll just ignore this outright because not many people will actually have that UA, and this might cause some big bills for some customers.

https://github.com/GoogleChrome/lighthouse/blob/main/core/config/constants.js#L42

@rafaeelaudibert rafaeelaudibert requested review from robbie-c and a team March 3, 2025 21:56
Copy link

vercel bot commented Mar 3, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
posthog-js ✅ Ready (Inspect) Visit Preview Mar 4, 2025 1:46pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Added two user agents to the blocked list to prevent data capture from Lighthouse testing tools using cryptic user agents, helping protect customers from potential billing impacts.

  • Added 'moto g power (2022)' to src/utils/blocked-uas.ts as it's a known Lighthouse testing user agent
  • Added 'chrome-lighthouse' to src/utils/blocked-uas.ts for explicit Lighthouse agent blocking
  • Documentation explains rationale for blocking these agents to prevent gaming of Lighthouse scores

1 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

@rafaeelaudibert rafaeelaudibert force-pushed the block-moto-g-4-lighthouse-agent branch from c77a385 to 9234a65 Compare March 3, 2025 22:00
@rafaeelaudibert rafaeelaudibert added the bump patch Bump patch version when this PR gets merged label Mar 3, 2025
Copy link

github-actions bot commented Mar 3, 2025

Size Change: +2.52 kB (+0.07%)

Total Size: 3.56 MB

Filename Size Change
dist/array.full.es5.js 273 kB +252 B (+0.09%)
dist/array.full.js 376 kB +252 B (+0.07%)
dist/array.full.no-external.js 375 kB +252 B (+0.07%)
dist/array.js 185 kB +252 B (+0.14%)
dist/array.no-external.js 183 kB +252 B (+0.14%)
dist/main.js 185 kB +252 B (+0.14%)
dist/module.full.js 376 kB +252 B (+0.07%)
dist/module.full.no-external.js 375 kB +252 B (+0.07%)
dist/module.js 185 kB +252 B (+0.14%)
dist/module.no-external.js 183 kB +252 B (+0.14%)
ℹ️ View Unchanged
Filename Size
dist/all-external-dependencies.js 219 kB
dist/customizations.full.js 14 kB
dist/dead-clicks-autocapture.js 14.5 kB
dist/exception-autocapture.js 9.51 kB
dist/external-scripts-loader.js 2.64 kB
dist/posthog-recorder.js 212 kB
dist/recorder-v2.js 115 kB
dist/recorder.js 115 kB
dist/surveys-preview.js 71.3 kB
dist/surveys.js 76 kB
dist/tracing-headers.js 1.76 kB
dist/web-vitals.js 10.4 kB

compressed-size-action

Copy link
Member

@robbie-c robbie-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, that's wild

// Believe it or not, these are all from Lighthouse
// https://github.com/GoogleChrome/lighthouse/blob/main/core/config/constants.js
'chrome-lighthouse',
'moto g power (2022)', // Mobile UA for Lighthouse: Mozilla/5.0 (Linux; Android 11; moto g power (2022)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Mobile Safari/537.36
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put these in the tests file?

// https://github.com/GoogleChrome/lighthouse/blob/main/core/config/constants.js
'chrome-lighthouse',
'moto g power (2022)', // Mobile UA for Lighthouse: Mozilla/5.0 (Linux; Android 11; moto g power (2022)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Mobile Safari/537.36
'Intel Mac OS X 10_15_7', // Desktop UA for Lighthouse: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, this should be lower case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh good catch, tests would have caught it, will add

Believe it or not, this is Lighthouse's user agent. They don't usually include "lighthouse" in it anymore because some people were gaming the system for LH scores, so they use some cryptic UAs now. I'll just ignore this outright because not many people will actually have that UA, and this might cause some big bills for some customers.
@rafaeelaudibert rafaeelaudibert force-pushed the block-moto-g-4-lighthouse-agent branch from 9234a65 to 68f8513 Compare March 4, 2025 13:42
@rafaeelaudibert rafaeelaudibert requested a review from robbie-c March 4, 2025 13:42
@robbie-c
Copy link
Member

robbie-c commented Mar 4, 2025

I did a bit more googling on this - I'm now not sure we should do it.

These are real user agents (afaik) and it's not impossible that we'd block real traffic.

I'm wondering if we need to take another approach. Where did it come up that these user agents were hitting a customer? I know that this one might be related https://posthoghelp.zendesk.com/agent/tickets/25267, where ahrefs uses lighthouse in one of its crawlers. I wonder if we could just ask ahrefs to identify itself somehow (and not through ip ranges, ideally)

@rafaeelaudibert
Copy link
Member Author

I did a bit more googling on this - I'm now not sure we should do it.

These are real user agents (afaik) and it's not impossible that we'd block real traffic.

It is possible, yes, but I expect traffic from a specific user-agent to be negligible. That's why I updated the code to actually include the whole user agent rather than just a subset of it.

I'm wondering if we need to take another approach. Where did it come up that these user agents were hitting a customer? I know that this one might be related posthoghelp.zendesk.com/agent/tickets/25267, where ahrefs uses lighthouse in one of its crawlers. I wonder if we could just ask ahrefs to identify itself somehow (and not through ip ranges, ideally)

This is from https://posthoghelp.zendesk.com/agent/tickets/25646 and it's possible they've enabled ahrefs or something else because it started on a specific day (February 3rd) and it's pretty consistent on how many accesses they have every day.


I feel confident on doing this if we go with the "include the whole user agent" approach, but I'm happy to just close this and let customers know about it if you disagree. They can always use custom_uagent_blocklist to block this themselves - which I told the customer to do already.

@robbie-c
Copy link
Member

robbie-c commented Mar 4, 2025

If we wanted, we could also include a flag to block these user agents too, I'm just not sure it should be the default.

IMO it's pretty bad that a crawler hasn't identified itself.

I'll send this to [email protected], any thoughts:

Hi there,

I work for PostHog, and one of our mutual customers has asked for help blocking the ahrefs crawler from their analytics data.

You use lighthouse as part of your crawling, which by default sets the user agent to a realistic user agent (see https://github.com/GoogleChrome/lighthouse/blob/main/core/config/constants.js#L42) which we aren't able to block. We've seen these user agents coming from your IP addresses (e.g. 5.39.1.230).

While we could block events coming from your IP addresses, it'd be very hard for us to do this client-side, where the IP address is not known, which would be an imperfect solution for some of our products.

Could you change the user agent, or at least provide your customers an option to change the user agent when crawling their own site?

Thanks,
Robbie

@rafaeelaudibert
Copy link
Member Author

If we wanted, we could also include a flag to block these user agents too, I'm just not sure it should be the default.

Yeah, not sure. They already have the power to add specific user-agents they wanna block, I don't see the value on having a block_lighthouse flag. Maybe just improved documentation on the matter - listing the offending user-agents - would be enough.

IMO it's pretty bad that a crawler hasn't identified itself.

100%, not very friendly

I'll send this to [email protected], any thoughts:

Hi there,

I work for PostHog, and one of our mutual customers has asked for help blocking the ahrefs crawler from their analytics data.

You use lighthouse as part of your crawling, which by default sets the user agent to a realistic user agent (see GoogleChrome/lighthouse@main/core/config/constants.js#L42) which we aren't able to block. We've seen these user agents coming from your IP addresses (e.g. 5.39.1.230).

While we could block events coming from your IP addresses, it'd be very hard for us to do this client-side, where the IP address is not known, which would be an imperfect solution for some of our products.

Could you change the user agent, or at least provide your customers an option to change the user agent when crawling their own site?

Thanks, Robbie

Can you CC me? I like the email

The support ticket I'm working on in specific isn't coming from Ahrefs though, but it might be something similar, I've let them know

177.244.11.68

@rafaeelaudibert
Copy link
Member Author

Making this a draft while we validate our solution with ahrefs

@rafaeelaudibert rafaeelaudibert marked this pull request as draft March 4, 2025 16:11
@robbie-c robbie-c removed their request for review March 10, 2025 10:12
@LuizHAP
Copy link

LuizHAP commented Mar 10, 2025

Thanks for helping this guys @rafaeelaudibert and @robbie-c. Meanwhile u guys thiking about it, I opened a ticket with Arcjet (https://app.arcjet.com/) and they will share with me what they thoughts around it, because maybe they suffered with the same issue (idk if will impact your product or no, but will be good to understand how other SaaS worked on this)

@davidmytton
Copy link

Thanks for flagging this to us @LuizHAP.

All the known bot user agents we track are open source at https://github.com/arcjet/well-known-bots. In the case where the user agent is one of those spoofed user agents, Arcjet will do an IP lookup against our reputation database. If it's from a known crawler like Ahrefs then we'll return a deny decision.

@rafaeelaudibert
Copy link
Member Author

Thanks for flagging this to us @LuizHAP.

All the known bot user agents we track are open source at arcjet/well-known-bots. In the case where the user agent is one of those spoofed user agents, Arcjet will do an IP lookup against our reputation database. If it's from a known crawler like Ahrefs then we'll return a deny decision.

Hey, @davidmytton, thank you for sharing that!

We're considering adding an IP lookup as a last resort. Ideally, we'd get the "bad" actors to improve their behavior but we both know that's not that easy. Besides the Lighthouse UAs, do you have something else on the list of known spoofed UAs?

@davidmytton
Copy link

We're considering adding an IP lookup as a last resort. Ideally, we'd get the "bad" actors to improve their behavior but we both know that's not that easy. Besides the Lighthouse UAs, do you have something else on the list of known spoofed UAs?

The UA matching is the first step. If there's no match then we use our private IP reputation database. We also do verification for known providers e.g. we verify whether a request claiming to be Google is actually Google through their rDNS lookup & IP ranges. We do this for several providers who offer those lookup options so clients can't pretend to be good bots!

@posthog-bot
Copy link
Collaborator

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

@pauldambra
Copy link
Member

@benjackwhite the arcjet tool above is interesting... too heavy to include in the SDK but could potentially offer a transformation that uses it to augment during ingestion

(dropping this into your brain, but obvs feel free to ignore 🤣)

@posthog-bot posthog-bot removed the stale label Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bump patch Bump patch version when this PR gets merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants