All notable changes to crawlberg are documented here.
Maintenance release. Version-only bump synced across all manifests; .gitignore
ai-rulez block reorganized.
First stable release. Promotes 1.0.0-rc.2; version-only bump synced across all manifests.
Release candidate 2. Maintenance release with version bump.
- Renamed the project from
kreuzcrawltocrawlberg. The crate (crawlberg), every per-language package, the C FFI symbol prefix (kcrawl_*→cberg_*), the Go module (github.com/xberg-io/crawlberg), and the docs domain (docs.crawlberg.xberg.io) follow. - Rebranded the
kreuzbergnamespace toxberg. npm scope@kreuzberg→@xberg-io, JVM/Maven groupIddev.kreuzberg→io.xberg, ecosystem links and badges move togithub.com/xberg-io/xbergand theXberg.devbrand, andKREUZBERG_*env vars becomeCRAWLBERG_*. The legal entity name (Kreuzberg, Inc.) is unchanged.
- Swift publish now creates the
release/swift/<version>branch carrying the substituted XCFramework checksum. The alef-generated Swift e2e/test-app pins.package(url: …, branch: "release/swift/<version>"), but the publish workflow only force-moved thev<version>tag and never created that branch, so SwiftPM could not resolve the package. The checksummed commit is now also pushed torefs/heads/release/swift/<version>. (.github/workflows/publish.yaml)
First stable release. crawlberg ships a Rust core with active bindings for Python, TypeScript/Node, Ruby, PHP, Go, Java/JNI, C#, Elixir, WebAssembly, Dart, Kotlin/Android, Swift, Zig, and C FFI, plus a CLI, an HTTP API, and an MCP server.
- Tiered dispatch engine. The crawl engine chains HTTP → Bypass → Browser
tiers driven by per-attempt signals rather than a single bypass
short-circuit. Public
crawlberg::types::dispatchsurface:Tier,EscalationStrategy,EscalationReason,AttemptOutcome,RetryDirective,RetryPolicy,WafSignal,WafClassifier,DomainStatePort,DomainRecommendation,EscalationBudget, andDispatchProfile(dispatch enums are#[non_exhaustive]).CrawlConfig::builder()andDispatchProfile::builder()provide fluent construction. - WAF detection. A TOML fingerprint corpus (
rules/waf_fingerprints.toml, 34 fingerprints) with an Aho-Corasick matcher,TomlClassifier::watch()hot-reload (debounced, atomicArcSwap, Kubernetes ConfigMap-safe), andEwmaDomainStatefor per-domain block-rate tracking that promotes/demotes the starting tier. - SSRF defense. New
crawlberg::net::ssrfmodule —SsrfPolicy,HostMatcher(Exact/Suffix/Cidr),SsrfError, and asyncvalidate_url.CrawlConfig::ssrfplus builder methodsallow_private_networks(bool)andssrf_allowlist_host(HostMatcher);CrawlError::SsrfPolicyViolation. Exposed as a settable DTO (deny_private,max_redirects) across every binding. - Browser pool injection.
BrowserPool/BrowserPoolConfigandNativeBrowserExecutor/NativeBrowserExecutorConfigare public;CrawlEngineBuilder::with_browser_pool/with_native_executorandCrawlEngineHandle::from_enginelet consumers construct andwarm()a pool once and reuse it across all crawl jobs. - Public substrate parsers.
crawlberg::robotsandcrawlberg::sitemapare public (parse_robots_txt,is_path_allowed,RobotsRules,parse_sitemap_xml,parse_sitemap_index,is_sitemap_index) — usable without spinning up the engine. - Pluggable proxy rotation.
ProxyProvidertrait +StaticProxyProviderbaseline, wired into the reqwest fetch path viaCrawlEngineBuilder::with_proxy_provider; called per request and taking precedence over the staticCrawlConfig::proxyvalue. - CLI.
batch-scrape,batch-crawl,download,citations, andversionsubcommands, bringing the CLI to 1:1 with the core and MCP surfaces. - MCP server. Tools are 1:1 with the CLI (
batch_crawl,generate_citations, …), each declaringread_only/destructive/open_worldsafety annotations, and are served over both stdio and rmcp Streamable HTTP at/mcpwhen the binary is built with theapi+mcpfeatures. - Observability. OpenTelemetry counters
crawlberg_waf_fingerprint_matches_totalandcrawlberg_escalations_total, plus property tests, cargo-fuzz targets, and Criterion benchmarks covering the WAF subsystem.
- Memory-bounded streaming crawl.
crawl_stream/batch_crawl_streammove each page into itsCrawlEvent::Pageand drop it instead of accumulating every page, bounding peak memory on large crawls (≈2.5 GB → ≈20 MB working set).crawl()'s batch result is unchanged. - Dispatch model.
CrawlError::WafBlockedis now a struct variant ({ vendor, message });DomainStatePortmoved to an observation model (recommend/observe);SimpleRetryPolicy's off-by-one is fixed (max_retries=3yields 3 retries);#[non_exhaustive]added toCrawlError,NetworkErrorKind, and the dispatch enums so future variants are non-breaking. - Asset downloads route through
http_fetch, so every file fetch is subject to the SSRF policy.
- Crawl loop materializes downloaded documents. The
download_documentsflag was previously honored only by single-pagescrape(); the crawl loop now buildsCrawlPageResult.downloaded_documentfor linked PDFs/DOCX via a shared helper instead of fetching, flagging, and discarding the bytes. - SSRF rollout hardening. Follow-up fixes to the SSRF refactor: redirect
final_urlis tracked again (per-hop re-validation moved intofollow_redirects), within-batch URL dedup no longer races, crawl child-depth is incremented (restoringmax_depthandinclude_pathssemantics), andCrawlConfigJSON deserialization honorsCRAWLBERG_ALLOW_PRIVATE_NETWORKthrough aSsrfPolicy::from_envserde default. Each is covered by a regression test. - MCP server exposed zero tools. The handler was missing rmcp's
#[tool_handler], sotools/list/tools/callreturned an empty list over both stdio and HTTP; it now delegates to the generated tool router.
- SSRF defense, enabled by default.
scrape(),crawl(),batch_crawl(), sitemap fetch, robots.txt fetch, and asset download refuse URLs resolving to loopback (127.0.0.0/8), RFC1918 private networks, link-local (169.254.0.0/16), cloud metadata (0.0.0.0/8), multicast (224.0.0.0/4), IPv6 ULA (fc00::/7), IPv6 link-local (fe80::/10), IPv6 multicast (ff00::/8), or any non-http(s) scheme. Includes DNS-rebinding mitigation (every resolved IP must pass the policy), redirect-chain re-validation (bounded byssrf.max_redirects, default 5), and link-enqueue validation with bounded concurrency. Opt out viaCRAWLBERG_ALLOW_PRIVATE_NETWORK=1orCrawlConfig::allow_private_networks(true).
- Bindings, facades, READMEs, docs, stubs, and e2e suites are generated by alef (pinned at 0.26.6) across all 14 language targets.
- Publish-pipeline hardening: a native per-arch Docker matrix that drops QEMU emulation, Flutter-free Dart native builds for pub.dev, Swift artifactbundle checksum injection and Apple system-framework linking, and lockfile-preserving source publishes for the Elixir NIF, PHP extension, and Ruby gem.