Skip to content

Commit 122e4bc

Browse files
authored
feat: first implementation of honeypot logic (#1342)
* feat: first implementation of honeypot logic This is a bit of an experiment, stick with me. The core idea here is that badly written crawlers are that: badly written. They look for anything that contains `<a href="whatever" />` tags and will blindly use those values to recurse. This takes advantage of that by hiding a link in a `<script>` tag like this: ```html <script type="ignore"><a href="/bots-only">Don't click</a></script> ``` Browsers will ignore it because they have no handler for the "ignore" script type. This current draft is very unoptimized (it takes like 7 seconds to generate a page on my tower), however switching spintax libraries will make this much faster. The hope is to make this pluggable with WebAssembly such that we force administrators to choose a storage method. First we crawl before we walk. The AI involvement in this commit is limited to the spintax in affirmations.txt, spintext.txt, and titles.txt. This generates a bunch of "pseudoprofound bullshit" like the following: > This Restoration to Balance & Alignment > > There's a moment when creators are being called to realize that the work > can't be reduced to results, but about energy. We don't innovate products > by pushing harder, we do it by holding the vision. Because momentum can't > be forced, it unfolds over time when culture are moving in the same > direction. We're being invited into a paradigm shift in how we think > about innovation. [...] This is intended to "look" like normal article text. As this is a first draft, this sucks and will be improved upon. Assisted-by: GLM 4.6, ChatGPT, GPT-OSS 120b Signed-off-by: Xe Iaso <me@xeiaso.net> * fix(honeypot/naive): optimize hilariously Signed-off-by: Xe Iaso <me@xeiaso.net> * feat(honeypot/naive): attempt to automatically filter out based on crawling Signed-off-by: Xe Iaso <me@xeiaso.net> * fix(lib): use mazeGen instead of bsGen Signed-off-by: Xe Iaso <me@xeiaso.net> * docs: add honeypot docs Signed-off-by: Xe Iaso <me@xeiaso.net> * chore(test): go mod tidy Signed-off-by: Xe Iaso <me@xeiaso.net> * chore: fix spelling metadata Signed-off-by: Xe Iaso <me@xeiaso.net> * chore: spelling Signed-off-by: Xe Iaso <me@xeiaso.net> --------- Signed-off-by: Xe Iaso <me@xeiaso.net>
1 parent cb91145 commit 122e4bc

25 files changed

Lines changed: 968 additions & 84 deletions

.github/actions/spelling/allow.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,9 @@ maintnotifications
1212
azurediamond
1313
cooldown
1414
verifyfcrdns
15+
Spintax
16+
spintax
17+
clampip
18+
pseudoprofound
19+
reimagining
20+
iocaine

.github/actions/spelling/expect.txt

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
21
acs
32
Actorified
43
actorifiedstore
@@ -398,3 +397,13 @@ Zenos
398397
zizmor
399398
zombocom
400399
zos
400+
GLM
401+
iocaine
402+
nikandfor
403+
pagegen
404+
pseudoprofound
405+
reimagining
406+
Rhul
407+
shoneypot
408+
spammer
409+
Y'shtola

docs/docs/CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,12 @@ Anubis is back and better than ever! Lots of minor fixes with some big ones inte
2828
- Open Graph passthrough now reuses the configured target Host/SNI/TLS settings, so metadata fetches succeed when the upstream certificate differs from the public domain. ([1283](https://github.com/TecharoHQ/anubis/pull/1283))
2929
- Stabilize the CVE-2025-24369 regression test by always submitting an invalid proof instead of relying on random POW failures.
3030

31+
### Dataset poisoning
32+
33+
Anubis has the ability to engage in [dataset poisoning attacks](https://www.anthropic.com/research/small-samples-poison) using the [dataset poisoning subsystem](./admin/honeypot/overview.mdx). This allows every Anubis instance to be a honeypot to attract and flag abusive scrapers so that no administrator action is required to ban them.
34+
35+
There is much more information about this feature in [the dataset poisoning subsystem documentation](./admin/honeypot/overview.mdx). Administrators that are interested in learning how this feature works should consult that documentation.
36+
3137
### Deprecate `report_as` in challenge configuration
3238

3339
Previously Anubis let you lie to users about the difficulty of a challenge to interfere with operators of malicious scrapers as a psychological attack:
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"label": "Honeypot",
3+
"position": 40,
4+
"link": {
5+
"type": "generated-index",
6+
"description": "Honeypot features in Anubis, allowing Anubis to passively detect malicious crawlers."
7+
}
8+
}
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
---
2+
title: Dataset poisoning
3+
---
4+
5+
Anubis offers the ability to participate in [dataset poisoning](https://www.anthropic.com/research/small-samples-poison) attacks similar to what [iocaine](https://iocaine.madhouse-project.org/) and other similar tools offer. Currently this is in a preview state where a lot of details are hard-coded in order to test the viability of this approach.
6+
7+
In essence, when Anubis challenge and error pages are rendered they include a small bit of HTML code that browsers will ignore but scrapers will interpret as a link to ingest. This will then create a small forest of recursive nothing pages that are designed according to the following principles:
8+
9+
- These pages are _cheap_ to render, rendering in at most ten milliseconds on decently specced hardware.
10+
- These pages are _vacuous_, meaning that they essentially are devoid of content such that a human would find it odd and click away, but a scraper would not be able to know that and would continue through the forest.
11+
- These pages are _fairly large_ so that scrapers don't think that the pages are error pages or are otherwise devoid of content.
12+
- These pages are _fully self-contained_ so that they load fast without incurring additional load from resource fetches.
13+
14+
In this limited preview state, Anubis generates pages using [spintax](https://outboundly.ai/blogs/what-is-spintax-and-how-to-use-it/). Spintax is a syntax that is used to create different variants of utterances for use in marketing messages and email spam that evades word filtering. In its current form, Anubis' dataset poisoning has AI generated spintax that generates vapid LinkedIn posts with some western occultism thrown in for good measure. This results in utterances like the following:
15+
16+
> There's a moment when visionaries are being called to realize that the work can't be reduced to optimization, but about resonance. We don't transform products by grinding endlessly, we do it by holding the vision. Because meaning can't be forced, it unfolds over time when culture are in integrity. This moment represents a fundamental reimagining in how we think about work. This isn't a framework, it's a lived truth that requires courage. When we get honest, we activate nonlinear growth that don't show up in dashboards, but redefine success anyway.
17+
18+
This should be fairly transparent to humans that this is pseudoprofound anti-content and is a signal to click away.
19+
20+
## Plans
21+
22+
Future versions of this feature will allow for more customization. In the near future this will be configurable via the following mechanisms:
23+
24+
- WebAssembly logic for customizing how the poisoning data is generated (with examples including the existing spintax method).
25+
- Weight thresholds and logic for how they are interpreted by Anubis.
26+
- Other configuration settings as facts and circumstances dictate.
27+
28+
## Implementation notes
29+
30+
In its current implementation, the Anubis dataset poisoning feature has the following flaws that may hinder production deployments:
31+
32+
- All Anubis instances use the same method for generating dataset poisoning information. This may be easy for malicious actors to detect and ignore.
33+
- Anubis dataset poisoning routes are under the `/.within.website/x/cmd/anubis` URL hierarchy. This may be easy for malicious actors to detect and ignore.
34+
35+
Right now Anubis assigns 30 weight points if the following criteria are met:
36+
37+
- A client's User-Agent has been observed in the dataset poisoning maze at least 25 times.
38+
- The network-clamped IP address (/24 for IPv4 and /48 for IPv6) has been observed in the dataset poisoning maze at least 25 times.
39+
40+
Additionally, when any given client by both User-Agent and network-clamped IP address has been observed, Anubis will emit log lines warning about it so that administrative action can be taken up to and including [filing abuse reports with the network owner](/blog/2025/file-abuse-reports).

go.mod

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ require (
2020
github.com/joho/godotenv v1.5.1
2121
github.com/lum8rjack/go-ja4h v0.0.0-20250828030157-fa5266d50650
2222
github.com/nicksnyder/go-i18n/v2 v2.6.0
23+
github.com/nikandfor/spintax v0.0.0-20181023094358-fc346b245bb3
2324
github.com/playwright-community/playwright-go v0.5200.1
2425
github.com/prometheus/client_golang v1.23.2
2526
github.com/redis/go-redis/v9 v9.17.2

go.sum

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,8 @@ github.com/natefinch/atomic v1.0.1 h1:ZPYKxkqQOx3KZ+RsbnP/YsgvxWQPGxjC0oBt2AhwV0
320320
github.com/natefinch/atomic v1.0.1/go.mod h1:N/D/ELrljoqDyT3rZrsUmtsuzvHkeB/wWjHV22AZRbM=
321321
github.com/nicksnyder/go-i18n/v2 v2.6.0 h1:C/m2NNWNiTB6SK4Ao8df5EWm3JETSTIGNXBpMJTxzxQ=
322322
github.com/nicksnyder/go-i18n/v2 v2.6.0/go.mod h1:88sRqr0C6OPyJn0/KRNaEz1uWorjxIKP7rUUcvycecE=
323+
github.com/nikandfor/spintax v0.0.0-20181023094358-fc346b245bb3 h1:foZ9X1bz2KmW7b8Yx5V0LAQKhTazdllv5rnGUe6iGTY=
324+
github.com/nikandfor/spintax v0.0.0-20181023094358-fc346b245bb3/go.mod h1:wwDYKfVF3WHdY0rugsAZoIpyQjDA3bn9wEzo/QXPx1Y=
323325
github.com/onsi/gomega v1.35.1 h1:Cwbd75ZBPxFSuZ6T+rN/WCb/gOc6YgFBXLlZLhC7Ds4=
324326
github.com/onsi/gomega v1.35.1/go.mod h1:PvZbdDc8J6XJEpDK4HCuRBm8a6Fzp9/DmhC9C7yFlog=
325327
github.com/opencontainers/go-digest v1.0.0 h1:apOUWs51W5PlhuyGyz9FCeeBIOUDA/6nW8Oi/yOhh5U=

internal/clampip.go

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
package internal
2+
3+
import "net/netip"
4+
5+
func ClampIP(addr netip.Addr) (netip.Prefix, bool) {
6+
switch {
7+
case addr.Is4():
8+
result, err := addr.Prefix(24)
9+
if err != nil {
10+
return netip.Prefix{}, false
11+
}
12+
return result, true
13+
14+
case addr.Is4In6():
15+
// Extract the IPv4 address from IPv4-mapped IPv6 and clamp it
16+
ipv4 := addr.Unmap()
17+
result, err := ipv4.Prefix(24)
18+
if err != nil {
19+
return netip.Prefix{}, false
20+
}
21+
return result, true
22+
23+
case addr.Is6():
24+
result, err := addr.Prefix(48)
25+
if err != nil {
26+
return netip.Prefix{}, false
27+
}
28+
return result, true
29+
30+
default:
31+
return netip.Prefix{}, false
32+
}
33+
}

0 commit comments

Comments
 (0)