feat: Block Lighthouse cryptic user agent #1779

rafaeelaudibert · 2025-03-03T21:56:26Z

Believe it or not, this is Lighthouse's user agent. They don't usually include "lighthouse" in it anymore because some people were gaming the system for LH scores, so they use some cryptic UAs now. I'll just ignore this outright because not many people will actually have that UA, and this might cause some big bills for some customers.

https://github.com/GoogleChrome/lighthouse/blob/main/core/config/constants.js#L42

vercel · 2025-03-03T21:56:30Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated (UTC)
posthog-js	✅ Ready (Inspect)	Visit Preview	Mar 4, 2025 1:46pm

greptile-apps

PR Summary

Added two user agents to the blocked list to prevent data capture from Lighthouse testing tools using cryptic user agents, helping protect customers from potential billing impacts.

Added 'moto g power (2022)' to src/utils/blocked-uas.ts as it's a known Lighthouse testing user agent
Added 'chrome-lighthouse' to src/utils/blocked-uas.ts for explicit Lighthouse agent blocking
Documentation explains rationale for blocking these agents to prevent gaming of Lighthouse scores

_{1 file(s) reviewed, no comment(s)}
_{Edit PR Review Bot Settings | Greptile}

github-actions · 2025-03-03T22:01:15Z

Size Change: +2.52 kB (+0.07%)

Total Size: 3.56 MB

Filename	Size	Change
`dist/array.full.es5.js`	273 kB	+252 B (+0.09%)
`dist/array.full.js`	376 kB	+252 B (+0.07%)
`dist/array.full.no-external.js`	375 kB	+252 B (+0.07%)
`dist/array.js`	185 kB	+252 B (+0.14%)
`dist/array.no-external.js`	183 kB	+252 B (+0.14%)
`dist/main.js`	185 kB	+252 B (+0.14%)
`dist/module.full.js`	376 kB	+252 B (+0.07%)
`dist/module.full.no-external.js`	375 kB	+252 B (+0.07%)
`dist/module.js`	185 kB	+252 B (+0.14%)
`dist/module.no-external.js`	183 kB	+252 B (+0.14%)

ℹ️ View Unchanged

Filename	Size
`dist/all-external-dependencies.js`	219 kB
`dist/customizations.full.js`	14 kB
`dist/dead-clicks-autocapture.js`	14.5 kB
`dist/exception-autocapture.js`	9.51 kB
`dist/external-scripts-loader.js`	2.64 kB
`dist/posthog-recorder.js`	212 kB
`dist/recorder-v2.js`	115 kB
`dist/recorder.js`	115 kB
`dist/surveys-preview.js`	71.3 kB
`dist/surveys.js`	76 kB
`dist/tracing-headers.js`	1.76 kB
`dist/web-vitals.js`	10.4 kB

_{compressed-size-action}

robbie-c

TIL, that's wild

robbie-c · 2025-03-03T22:55:36Z

src/utils/blocked-uas.ts

+    // Believe it or not, these are all from Lighthouse
+    // https://github.com/GoogleChrome/lighthouse/blob/main/core/config/constants.js
+    'chrome-lighthouse',
+    'moto g power (2022)', // Mobile UA for Lighthouse: Mozilla/5.0 (Linux; Android 11; moto g power (2022)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Mobile Safari/537.36


put these in the tests file?

robbie-c · 2025-03-03T23:10:22Z

src/utils/blocked-uas.ts

+    // https://github.com/GoogleChrome/lighthouse/blob/main/core/config/constants.js
+    'chrome-lighthouse',
+    'moto g power (2022)', // Mobile UA for Lighthouse: Mozilla/5.0 (Linux; Android 11; moto g power (2022)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Mobile Safari/537.36
+    'Intel Mac OS X 10_15_7', // Desktop UA for Lighthouse: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36


oh, this should be lower case

oh good catch, tests would have caught it, will add

Believe it or not, this is Lighthouse's user agent. They don't usually include "lighthouse" in it anymore because some people were gaming the system for LH scores, so they use some cryptic UAs now. I'll just ignore this outright because not many people will actually have that UA, and this might cause some big bills for some customers.

robbie-c · 2025-03-04T13:46:42Z

I did a bit more googling on this - I'm now not sure we should do it.

These are real user agents (afaik) and it's not impossible that we'd block real traffic.

I'm wondering if we need to take another approach. Where did it come up that these user agents were hitting a customer? I know that this one might be related https://posthoghelp.zendesk.com/agent/tickets/25267, where ahrefs uses lighthouse in one of its crawlers. I wonder if we could just ask ahrefs to identify itself somehow (and not through ip ranges, ideally)

rafaeelaudibert · 2025-03-04T13:56:39Z

I did a bit more googling on this - I'm now not sure we should do it.

These are real user agents (afaik) and it's not impossible that we'd block real traffic.

It is possible, yes, but I expect traffic from a specific user-agent to be negligible. That's why I updated the code to actually include the whole user agent rather than just a subset of it.

I'm wondering if we need to take another approach. Where did it come up that these user agents were hitting a customer? I know that this one might be related posthoghelp.zendesk.com/agent/tickets/25267, where ahrefs uses lighthouse in one of its crawlers. I wonder if we could just ask ahrefs to identify itself somehow (and not through ip ranges, ideally)

This is from https://posthoghelp.zendesk.com/agent/tickets/25646 and it's possible they've enabled ahrefs or something else because it started on a specific day (February 3rd) and it's pretty consistent on how many accesses they have every day.

I feel confident on doing this if we go with the "include the whole user agent" approach, but I'm happy to just close this and let customers know about it if you disagree. They can always use custom_uagent_blocklist to block this themselves - which I told the customer to do already.

robbie-c · 2025-03-04T14:04:28Z

If we wanted, we could also include a flag to block these user agents too, I'm just not sure it should be the default.

IMO it's pretty bad that a crawler hasn't identified itself.

I'll send this to [email protected], any thoughts:

Hi there,

I work for PostHog, and one of our mutual customers has asked for help blocking the ahrefs crawler from their analytics data.

You use lighthouse as part of your crawling, which by default sets the user agent to a realistic user agent (see https://github.com/GoogleChrome/lighthouse/blob/main/core/config/constants.js#L42) which we aren't able to block. We've seen these user agents coming from your IP addresses (e.g. 5.39.1.230).

While we could block events coming from your IP addresses, it'd be very hard for us to do this client-side, where the IP address is not known, which would be an imperfect solution for some of our products.

Could you change the user agent, or at least provide your customers an option to change the user agent when crawling their own site?

Thanks,
Robbie

rafaeelaudibert · 2025-03-04T14:14:26Z

If we wanted, we could also include a flag to block these user agents too, I'm just not sure it should be the default.

Yeah, not sure. They already have the power to add specific user-agents they wanna block, I don't see the value on having a block_lighthouse flag. Maybe just improved documentation on the matter - listing the offending user-agents - would be enough.

IMO it's pretty bad that a crawler hasn't identified itself.

100%, not very friendly

I'll send this to [email protected], any thoughts:

Hi there,

I work for PostHog, and one of our mutual customers has asked for help blocking the ahrefs crawler from their analytics data.

You use lighthouse as part of your crawling, which by default sets the user agent to a realistic user agent (see GoogleChrome/lighthouse@main/core/config/constants.js#L42) which we aren't able to block. We've seen these user agents coming from your IP addresses (e.g. 5.39.1.230).

While we could block events coming from your IP addresses, it'd be very hard for us to do this client-side, where the IP address is not known, which would be an imperfect solution for some of our products.

Could you change the user agent, or at least provide your customers an option to change the user agent when crawling their own site?

Thanks, Robbie

Can you CC me? I like the email

The support ticket I'm working on in specific isn't coming from Ahrefs though, but it might be something similar, I've let them know

177.244.11.68

rafaeelaudibert · 2025-03-04T16:11:54Z

Making this a draft while we validate our solution with ahrefs

LuizHAP · 2025-03-10T18:44:23Z

Thanks for helping this guys @rafaeelaudibert and @robbie-c. Meanwhile u guys thiking about it, I opened a ticket with Arcjet (https://app.arcjet.com/) and they will share with me what they thoughts around it, because maybe they suffered with the same issue (idk if will impact your product or no, but will be good to understand how other SaaS worked on this)

davidmytton · 2025-03-10T20:11:32Z

Thanks for flagging this to us @LuizHAP.

All the known bot user agents we track are open source at https://github.com/arcjet/well-known-bots. In the case where the user agent is one of those spoofed user agents, Arcjet will do an IP lookup against our reputation database. If it's from a known crawler like Ahrefs then we'll return a deny decision.

rafaeelaudibert · 2025-03-10T20:47:33Z

Thanks for flagging this to us @LuizHAP.

All the known bot user agents we track are open source at arcjet/well-known-bots. In the case where the user agent is one of those spoofed user agents, Arcjet will do an IP lookup against our reputation database. If it's from a known crawler like Ahrefs then we'll return a deny decision.

Hey, @davidmytton, thank you for sharing that!

We're considering adding an IP lookup as a last resort. Ideally, we'd get the "bad" actors to improve their behavior but we both know that's not that easy. Besides the Lighthouse UAs, do you have something else on the list of known spoofed UAs?

davidmytton · 2025-03-10T21:00:57Z

We're considering adding an IP lookup as a last resort. Ideally, we'd get the "bad" actors to improve their behavior but we both know that's not that easy. Besides the Lighthouse UAs, do you have something else on the list of known spoofed UAs?

The UA matching is the first step. If there's no match then we use our private IP reputation database. We also do verification for known providers e.g. we verify whether a request claiming to be Google is actually Google through their rDNS lookup & IP ranges. We do this for several providers who offer those lookup options so clients can't pretend to be good bots!

posthog-bot · 2025-03-18T09:34:17Z

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

pauldambra · 2025-03-18T10:04:10Z

@benjackwhite the arcjet tool above is interesting... too heavy to include in the SDK but could potentially offer a transformation that uses it to augment during ingestion

(dropping this into your brain, but obvs feel free to ignore 🤣)

posthog-bot · 2025-03-27T09:34:30Z

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

posthog-bot · 2025-04-04T09:34:11Z

This PR was closed due to lack of activity. Feel free to reopen if it's still relevant.

rafaeelaudibert requested review from robbie-c and a team March 3, 2025 21:56

greptile-apps bot reviewed Mar 3, 2025

View reviewed changes

vercel bot deployed to Preview March 3, 2025 22:00 View deployment

rafaeelaudibert force-pushed the block-moto-g-4-lighthouse-agent branch from c77a385 to 9234a65 Compare March 3, 2025 22:00

rafaeelaudibert added the bump patch Bump patch version when this PR gets merged label Mar 3, 2025

vercel bot deployed to Preview March 3, 2025 22:04 View deployment

robbie-c approved these changes Mar 3, 2025

View reviewed changes

robbie-c reviewed Mar 3, 2025

View reviewed changes

robbie-c requested changes Mar 3, 2025

View reviewed changes

rafaeelaudibert force-pushed the block-moto-g-4-lighthouse-agent branch from 9234a65 to 68f8513 Compare March 4, 2025 13:42

rafaeelaudibert requested a review from robbie-c March 4, 2025 13:42

vercel bot deployed to Preview March 4, 2025 13:46 View deployment

rafaeelaudibert marked this pull request as draft March 4, 2025 16:11

rafaeelaudibert mentioned this pull request Mar 7, 2025

Bot traffic (Google Lighthouse) not filtered out #1800

Closed

robbie-c removed their request for review March 10, 2025 10:12

posthog-bot added the stale label Mar 18, 2025

posthog-bot removed the stale label Mar 19, 2025

posthog-bot added the stale label Mar 27, 2025

posthog-bot closed this Apr 4, 2025

feat: Block Lighthouse cryptic user agent #1779

feat: Block Lighthouse cryptic user agent #1779

Uh oh!

Conversation

rafaeelaudibert commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

Uh oh!

github-actions bot commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robbie-c left a comment

Choose a reason for hiding this comment

Uh oh!

robbie-c Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

robbie-c Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

rafaeelaudibert Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

robbie-c commented Mar 4, 2025

Uh oh!

rafaeelaudibert commented Mar 4, 2025

Uh oh!

robbie-c commented Mar 4, 2025

Uh oh!

rafaeelaudibert commented Mar 4, 2025

Uh oh!

rafaeelaudibert commented Mar 4, 2025

Uh oh!

LuizHAP commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidmytton commented Mar 10, 2025

Uh oh!

rafaeelaudibert commented Mar 10, 2025

Uh oh!

davidmytton commented Mar 10, 2025

Uh oh!

posthog-bot commented Mar 18, 2025

Uh oh!

pauldambra commented Mar 18, 2025

Uh oh!

posthog-bot commented Mar 27, 2025

Uh oh!

posthog-bot commented Apr 4, 2025

Uh oh!

Uh oh!

rafaeelaudibert commented Mar 3, 2025 •

edited

Loading

vercel bot commented Mar 3, 2025 •

edited

Loading

github-actions bot commented Mar 3, 2025 •

edited

Loading

LuizHAP commented Mar 10, 2025 •

edited

Loading