-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Block Lighthouse cryptic user agent #1779
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
Added two user agents to the blocked list to prevent data capture from Lighthouse testing tools using cryptic user agents, helping protect customers from potential billing impacts.
- Added 'moto g power (2022)' to
src/utils/blocked-uas.ts
as it's a known Lighthouse testing user agent - Added 'chrome-lighthouse' to
src/utils/blocked-uas.ts
for explicit Lighthouse agent blocking - Documentation explains rationale for blocking these agents to prevent gaming of Lighthouse scores
1 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile
c77a385
to
9234a65
Compare
Size Change: +2.52 kB (+0.07%) Total Size: 3.56 MB
ℹ️ View Unchanged
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL, that's wild
src/utils/blocked-uas.ts
Outdated
// Believe it or not, these are all from Lighthouse | ||
// https://github.com/GoogleChrome/lighthouse/blob/main/core/config/constants.js | ||
'chrome-lighthouse', | ||
'moto g power (2022)', // Mobile UA for Lighthouse: Mozilla/5.0 (Linux; Android 11; moto g power (2022)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Mobile Safari/537.36 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put these in the tests file?
src/utils/blocked-uas.ts
Outdated
// https://github.com/GoogleChrome/lighthouse/blob/main/core/config/constants.js | ||
'chrome-lighthouse', | ||
'moto g power (2022)', // Mobile UA for Lighthouse: Mozilla/5.0 (Linux; Android 11; moto g power (2022)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Mobile Safari/537.36 | ||
'Intel Mac OS X 10_15_7', // Desktop UA for Lighthouse: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, this should be lower case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh good catch, tests would have caught it, will add
Believe it or not, this is Lighthouse's user agent. They don't usually include "lighthouse" in it anymore because some people were gaming the system for LH scores, so they use some cryptic UAs now. I'll just ignore this outright because not many people will actually have that UA, and this might cause some big bills for some customers.
9234a65
to
68f8513
Compare
I did a bit more googling on this - I'm now not sure we should do it. These are real user agents (afaik) and it's not impossible that we'd block real traffic. I'm wondering if we need to take another approach. Where did it come up that these user agents were hitting a customer? I know that this one might be related https://posthoghelp.zendesk.com/agent/tickets/25267, where ahrefs uses lighthouse in one of its crawlers. I wonder if we could just ask ahrefs to identify itself somehow (and not through ip ranges, ideally) |
It is possible, yes, but I expect traffic from a specific user-agent to be negligible. That's why I updated the code to actually include the whole user agent rather than just a subset of it.
This is from https://posthoghelp.zendesk.com/agent/tickets/25646 and it's possible they've enabled ahrefs or something else because it started on a specific day (February 3rd) and it's pretty consistent on how many accesses they have every day. I feel confident on doing this if we go with the "include the whole user agent" approach, but I'm happy to just close this and let customers know about it if you disagree. They can always use |
If we wanted, we could also include a flag to block these user agents too, I'm just not sure it should be the default. IMO it's pretty bad that a crawler hasn't identified itself. I'll send this to [email protected], any thoughts: Hi there, I work for PostHog, and one of our mutual customers has asked for help blocking the ahrefs crawler from their analytics data. You use lighthouse as part of your crawling, which by default sets the user agent to a realistic user agent (see https://github.com/GoogleChrome/lighthouse/blob/main/core/config/constants.js#L42) which we aren't able to block. We've seen these user agents coming from your IP addresses (e.g. 5.39.1.230). While we could block events coming from your IP addresses, it'd be very hard for us to do this client-side, where the IP address is not known, which would be an imperfect solution for some of our products. Could you change the user agent, or at least provide your customers an option to change the user agent when crawling their own site? Thanks, |
Yeah, not sure. They already have the power to add specific user-agents they wanna block, I don't see the value on having a
100%, not very friendly
Can you CC me? I like the email The support ticket I'm working on in specific isn't coming from Ahrefs though, but it might be something similar, I've let them know |
Making this a draft while we validate our solution with ahrefs |
Thanks for helping this guys @rafaeelaudibert and @robbie-c. Meanwhile u guys thiking about it, I opened a ticket with Arcjet (https://app.arcjet.com/) and they will share with me what they thoughts around it, because maybe they suffered with the same issue (idk if will impact your product or no, but will be good to understand how other SaaS worked on this) |
Thanks for flagging this to us @LuizHAP. All the known bot user agents we track are open source at https://github.com/arcjet/well-known-bots. In the case where the user agent is one of those spoofed user agents, Arcjet will do an IP lookup against our reputation database. If it's from a known crawler like Ahrefs then we'll return a deny decision. |
Hey, @davidmytton, thank you for sharing that! We're considering adding an IP lookup as a last resort. Ideally, we'd get the "bad" actors to improve their behavior but we both know that's not that easy. Besides the Lighthouse UAs, do you have something else on the list of known spoofed UAs? |
The UA matching is the first step. If there's no match then we use our private IP reputation database. We also do verification for known providers e.g. we verify whether a request claiming to be Google is actually Google through their rDNS lookup & IP ranges. We do this for several providers who offer those lookup options so clients can't pretend to be good bots! |
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the |
@benjackwhite the arcjet tool above is interesting... too heavy to include in the SDK but could potentially offer a transformation that uses it to augment during ingestion (dropping this into your brain, but obvs feel free to ignore 🤣) |
Believe it or not, this is Lighthouse's user agent. They don't usually include "lighthouse" in it anymore because some people were gaming the system for LH scores, so they use some cryptic UAs now. I'll just ignore this outright because not many people will actually have that UA, and this might cause some big bills for some customers.
https://github.com/GoogleChrome/lighthouse/blob/main/core/config/constants.js#L42