Skip to content

Commit 68ab787

Browse files
feat: add multi-source v2 catalog (#18)
* feat: add multi-source v2 catalog * fix: address multi-source review feedback * fix: normalize brave debounce data
1 parent 65448fb commit 68ab787

97 files changed

Lines changed: 56050 additions & 158 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,8 +66,14 @@ jobs:
6666
- name: Verify generated catalog matches checked-in catalog
6767
run: |
6868
pnpm build:catalog
69-
git diff --exit-code catalog/clearurls.json \
70-
|| (echo "catalog/clearurls.json is stale — run \`pnpm build:catalog\` and commit" && exit 1)
69+
git diff --exit-code \
70+
catalog/adguard.json \
71+
catalog/brave.json \
72+
catalog/catalog.json \
73+
catalog/clearurls.json \
74+
catalog/firefox.json \
75+
crates/url-sanitize/catalog/catalog.json \
76+
|| (echo "generated catalog files are stale — run \`pnpm build:catalog\` and commit" && exit 1)
7177
- name: Verify generated conformance corpus matches checked-in files
7278
run: |
7379
pnpm build:conformance

.github/workflows/release.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -361,6 +361,10 @@ jobs:
361361
362362
publish_pkg "@url-sanitize/core" "@url-sanitize/core"
363363
publish_pkg "@url-sanitize/clearurls" "@url-sanitize/clearurls"
364+
publish_pkg "@url-sanitize/adguard" "@url-sanitize/adguard"
365+
publish_pkg "@url-sanitize/brave" "@url-sanitize/brave"
366+
publish_pkg "@url-sanitize/firefox" "@url-sanitize/firefox"
367+
publish_pkg "@url-sanitize/merged" "@url-sanitize/merged"
364368
publish_pkg "@url-sanitize/cli" "@url-sanitize/cli"
365369
publish_pkg "@url-sanitize/fetch" "@url-sanitize/fetch"
366370

.github/workflows/sync-clearurls.yml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: sync-clearurls
1+
name: sync-sources
22

33
on:
44
schedule:
@@ -25,12 +25,12 @@ jobs:
2525
- run: pnpm install --frozen-lockfile
2626

2727
- name: Fetch and verify upstream rules
28-
run: pnpm sync:clearurls
28+
run: pnpm sync:sources
2929

3030
- name: Detect changes
3131
id: diff
3232
run: |
33-
if git diff --quiet packages/clearurls/data; then
33+
if git diff --quiet packages/clearurls/data packages/adguard/data packages/brave/data packages/firefox/data; then
3434
echo "changed=false" >> "$GITHUB_OUTPUT"
3535
else
3636
echo "changed=true" >> "$GITHUB_OUTPUT"
@@ -52,12 +52,12 @@ jobs:
5252
uses: peter-evans/create-pull-request@5f6978faf089d4d20b00c7766989d076bb2fc7f1 # v8.1.1
5353
with:
5454
commit-message: 'release: v${{ steps.bump.outputs.new_version }}'
55-
title: 'chore(clearurls): sync upstream rules → v${{ steps.bump.outputs.new_version }}'
55+
title: 'chore(sources): sync upstream rules → v${{ steps.bump.outputs.new_version }}'
5656
body: |
57-
Automated daily sync of ClearURLs rule catalog from
58-
https://rules2.clearurls.xyz/data.minify.json.
59-
SHA256 verified against rules.minify.hash.
60-
branch: clearurls-sync
57+
Automated daily sync of url-sanitize upstream rule catalogs:
58+
ClearURLs, AdGuard URL Tracking Protection, Brave Debouncer, and
59+
Firefox Query Stripping.
60+
branch: source-sync
6161
delete-branch: true
6262

6363
- name: Enable auto-merge

.gitignore

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
node_modules/
22
dist/
3+
npm-packages/
34
*.tsbuildinfo
45
.turbo/
56
coverage/
@@ -21,3 +22,21 @@ __pycache__/
2122
target/
2223
Cargo.lock.bak
2324
**/*.rs.bk
25+
26+
# Source-adjacent TypeScript emits from accidental non-noEmit tsc runs
27+
packages/*/src/**/*.js
28+
packages/*/src/**/*.js.map
29+
packages/*/src/**/*.d.ts
30+
packages/*/src/**/*.d.ts.map
31+
packages/*/test/**/*.js
32+
packages/*/test/**/*.js.map
33+
packages/*/test/**/*.d.ts
34+
packages/*/test/**/*.d.ts.map
35+
sources/**/*.js
36+
sources/**/*.js.map
37+
sources/**/*.d.ts
38+
sources/**/*.d.ts.map
39+
vitest.config.js
40+
vitest.config.js.map
41+
vitest.config.d.ts
42+
vitest.config.d.ts.map

Cargo.lock

Lines changed: 9 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ resolver = "2"
33
members = ["crates/url-sanitize-core", "crates/url-sanitize"]
44

55
[workspace.package]
6-
version = "1.0.0"
6+
version = "2.0.0"
77
edition = "2021"
88
rust-version = "1.75"
99
license = "MIT"
@@ -12,11 +12,12 @@ repository = "https://github.com/antonio-orionus/url-sanitize"
1212
homepage = "https://github.com/antonio-orionus/url-sanitize"
1313

1414
[workspace.dependencies]
15-
url-sanitize-core = { path = "crates/url-sanitize-core", version = "1.0.0" }
15+
url-sanitize-core = { path = "crates/url-sanitize-core", version = "2.0.0" }
1616
regex-lite = "0.1"
1717
url = "2.5"
1818
serde = { version = "1", features = ["derive"] }
1919
serde_json = "1"
20+
base64 = "0.22"
2021

2122
[profile.release]
2223
opt-level = "z"

README.md

Lines changed: 29 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
# url-sanitize
22

3-
> Remove tracking parameters and unwrap tracking redirects from URLs with ClearURLs-compatible rules.
3+
> Remove tracking parameters and unwrap tracking redirects from URLs with ClearURLs, AdGuard, Brave, and Firefox rules.
44
5-
**Looking for CleanURLs / ClearURLs behavior as a library or CLI?** You're in the right place. `url-sanitize` removes tracking junk like `utm_*`, `fbclid`, and redirector wrappers while telling you exactly which rule changed the URL.
5+
**Looking for CleanURLs / ClearURLs behavior as a library or CLI?** You're in the right place. `url-sanitize` removes tracking junk like `utm_*`, `fbclid`, and redirector wrappers, now using a merged ClearURLs / AdGuard / Brave / Firefox catalog by default.
66

77
Use it from npm, crates.io, native release binaries, Python, CI, workers, browsers, edge runtimes, Node.js, Bun, and Deno.
88

99
## Why this exists
1010

1111
- **One behavior contract across languages.** TypeScript and Rust implementations are checked against the same JSONL conformance corpus.
1212
- **Explainable privacy cleanup.** Results include the stripped params, redirect provider, or block rule instead of returning an opaque string.
13-
- **ClearURLs-compatible without AGPL lock-in.** Code, CLIs, and tooling are MIT; ClearURLs-derived rule data remains LGPL-3.0-only.
13+
- **Multi-source without AGPL lock-in.** Code, CLIs, and tooling are MIT; upstream rule data keeps its source license.
1414
- **Automation-friendly.** The Rust CLI is deterministic, prompt-free, supports `--json`, and embeds a pinned catalog.
15-
- **Fresh rules.** GitHub Actions syncs the upstream ClearURLs catalog daily and release workflows publish npm packages, crates, Python wheels, and native binaries.
15+
- **Fresh rules.** GitHub Actions syncs upstream ClearURLs, AdGuard, Brave, and Firefox catalogs; release workflows publish npm packages, crates, Python wheels, and native binaries.
1616

1717
## Install
1818

@@ -39,7 +39,8 @@ irm https://github.com/antonio-orionus/url-sanitize/releases/latest/download/url
3939

4040
```sh
4141
npm install -g @url-sanitize/cli
42-
npm install @url-sanitize/core @url-sanitize/clearurls
42+
npm install @url-sanitize/core @url-sanitize/merged
43+
npm install @url-sanitize/clearurls @url-sanitize/adguard @url-sanitize/brave @url-sanitize/firefox
4344
npm install @url-sanitize/fetch
4445
cargo install url-sanitize
4546
cargo add url-sanitize-core
@@ -83,7 +84,7 @@ renders published metadata from GitHub Release `SHA256SUMS`.
8384
For CI, prefer a pinned release instead of `latest`:
8485

8586
```sh
86-
version="v1.0.0"
87+
version="v2.0.0"
8788
target="x86_64-unknown-linux-gnu"
8889
asset="url-sanitize-${target}.tar.gz"
8990

@@ -104,7 +105,7 @@ jobs:
104105
- name: Install url-sanitize
105106
run: |
106107
set -euo pipefail
107-
version="v1.0.0"
108+
version="v2.0.0"
108109
target="x86_64-unknown-linux-gnu"
109110
asset="url-sanitize-${target}.tar.gz"
110111
@@ -132,7 +133,7 @@ url-sanitize:
132133
script:
133134
- |
134135
set -eu
135-
version="v1.0.0"
136+
version="v2.0.0"
136137
target="x86_64-unknown-linux-gnu"
137138
asset="url-sanitize-${target}.tar.gz"
138139
@@ -151,7 +152,7 @@ Dockerfile:
151152
```Dockerfile
152153
FROM ubuntu:24.04
153154

154-
ARG URL_SANITIZE_VERSION=v1.0.0
155+
ARG URL_SANITIZE_VERSION=v2.0.0
155156
ARG URL_SANITIZE_TARGET=x86_64-unknown-linux-gnu
156157

157158
RUN apt-get update \
@@ -172,7 +173,7 @@ RUN set -eux; \
172173
## TypeScript Quick Start
173174

174175
```ts
175-
import { sanitize } from '@url-sanitize/clearurls';
176+
import { sanitize } from '@url-sanitize/merged';
176177

177178
const result = sanitize('https://example.com/article?utm_source=newsletter&id=123');
178179

@@ -190,9 +191,15 @@ console.log(result);
190191

191192
```ts
192193
import { compileSanitizer } from '@url-sanitize/core';
193-
import { clearurlsCatalog } from '@url-sanitize/clearurls';
194+
import { mergedCatalog } from '@url-sanitize/merged';
195+
196+
const sanitize = compileSanitizer(mergedCatalog, { stripReferralMarketing: true });
197+
```
194198

195-
const sanitize = compileSanitizer(clearurlsCatalog, { stripReferralMarketing: true });
199+
**ClearURLs-only behavior is still available:**
200+
201+
```ts
202+
import { sanitize } from '@url-sanitize/clearurls';
196203
```
197204

198205
## CLI Quick Start
@@ -210,7 +217,7 @@ url-sanitize --json "https://www.google.com/url?q=https%3A%2F%2Fexample.org"
210217
```rust
211218
use url_sanitize_core::{Catalog, SanitizerOptions};
212219

213-
let json = std::fs::read_to_string("catalog/clearurls.json")?;
220+
let json = std::fs::read_to_string("catalog/catalog.json")?;
214221
let catalog = Catalog::from_json(&json)?;
215222
let sanitizer = catalog.compile(SanitizerOptions::default());
216223
let result = sanitizer.sanitize("https://example.com/?utm_source=x");
@@ -224,17 +231,21 @@ println!("{}", serde_json::to_string(&result)?);
224231
| --- | --- | --- |
225232
| [`@url-sanitize/core`](packages/core) | Pure TypeScript sanitization engine. Zero runtime deps. | MIT |
226233
| [`@url-sanitize/clearurls`](packages/clearurls) | ClearURLs-compatible catalog + adapter. | MIT (code) + LGPL-3.0-only (data) |
234+
| [`@url-sanitize/adguard`](packages/adguard) | AdGuard URL Tracking Protection catalog + adapter. | LGPL-3.0-only |
235+
| [`@url-sanitize/brave`](packages/brave) | Brave Debouncer catalog + adapter. | MPL-2.0 |
236+
| [`@url-sanitize/firefox`](packages/firefox) | Firefox Query Stripping catalog + adapter. | MPL-2.0 |
237+
| [`@url-sanitize/merged`](packages/merged) | Default merged multi-source catalog. | MIT metadata + upstream data licenses |
227238
| [`@url-sanitize/cli`](packages/cli) | npm CLI for removing tracking parameters and redirect wrappers. | MIT |
228239
| [`@url-sanitize/fetch`](packages/fetch) | Runtime ClearURLs catalog fetch + SHA256 / pinned-hash verification. | MIT |
229240
| [`url-sanitize-core`](crates/url-sanitize-core) | Pure-Rust implementation. | MIT |
230-
| [`url-sanitize`](crates/url-sanitize) | Native Rust CLI with embedded ClearURLs catalog. | MIT |
241+
| [`url-sanitize`](crates/url-sanitize) | Native Rust CLI with embedded merged catalog. | MIT |
231242
| [`url-sanitize`](python) | Python wrapper around the native CLI. | MIT |
232243
| `@url-sanitize/action` | Deferred GitHub Action for downstream PR / docs hygiene. | MIT |
233244

234245
## GitHub Automation
235246

236247
- `ci.yml` verifies TypeScript build, typecheck, lint, tests, generated catalog freshness, generated conformance freshness, Rust fmt/clippy/tests/package checks, release binary size, npm/Python package smoke tests, installer smoke, and Homebrew/Scoop fixture smoke where runner support exists.
237-
- `sync-clearurls.yml` checks upstream ClearURLs daily and opens a version-bump PR when rules change.
248+
- `sync-sources` checks upstream rule sources daily and opens a version-bump PR when rules change.
238249
- `release-dry-run.yml` builds the release matrix on PRs, assembles archives, renders Homebrew/Scoop metadata, and validates installer/package-manager syntax before merge.
239250
- `auto-tag.yml` verifies release metadata, creates annotated release tags after package version bumps land on `main`, and explicitly dispatches `release.yml`.
240251
- `release.yml` publishes npm packages, Rust crates, PyPI package, native GitHub Release assets, Homebrew/Scoop metadata, installer smoke tests, package-manager install smoke, and public endpoint smoke from `v*` tags.
@@ -272,14 +283,14 @@ skips external package-manager publication.
272283
- **v0.1** — TypeScript engine, ClearURLs adapter, npm CLI, Rust engine, Rust CLI, shared conformance, daily sync workflow
273284
- **v0.2** — broader native archive coverage, installer refinements, Homebrew/Scoop, CI install examples
274285
- **v0.3** — runtime catalog fetching, custom user-defined catalogs, schema validation
275-
- **Deferred** — GitHub Action and MCP surfaces until downstream demand is concrete
276286
- **v1.0** — stable public API + result types + benchmarks + security policy
277-
- **v2.0** — multi-source: AdGuard URL Tracking, Brave Debouncer, Firefox query-strip
287+
- **v2.0** — multi-source packages for AdGuard URL Tracking, Brave Debouncer, Firefox query-strip, and a merged catalog
288+
- **Deferred** — GitHub Action, MCP, extra package managers, native npm packages, WASM, and in-process Python bindings
278289

279290
## Contributing
280291

281292
PRs welcome. See [CONTRIBUTING.md](CONTRIBUTING.md).
282293

283294
## License
284295

285-
MIT for engine + CLI + tooling. LGPL-3.0-only for ClearURLs-derived data in `@url-sanitize/clearurls`. See [LICENSE](LICENSE) and [docs/license-model.md](docs/license-model.md).
296+
MIT for engine + CLI + tooling. Bundled upstream rule data keeps its source license: ClearURLs and AdGuard are LGPL-3.0-only; Brave and Firefox data are MPL-2.0. See [LICENSE](LICENSE) and [docs/license-model.md](docs/license-model.md).

0 commit comments

Comments
 (0)