Releases: internetarchive/heritrix3
3.13.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New features
- Config editor: IDE-style completions for bean names and Spring XML (powered by the new bean docs generator). #684
- Job status API: The
sizeTotalsReportnow includes asizeOnDiskvalue totaling the size of the files inlatest/warcs. #700 - ExtractorJson: New extractor that extracts URI strings from JSON documents. #701
Bug fixes
- AbstractCookieStore: Fixed cookies with leading dot (.example.com) being ignored #691
- ExtractorHTML: Fixed attribute values longer than 2048 characters causing extraction of truncated strings. #697
- ClientFTP: Fixed MalformedServerReplyException when FTP sends a response with only an error code and no message. #694
- BdbMultipleWorkQueues: Added null checks, type validation, and warning logs in BdbMultipleWorkQueues.delete() to improve frontier stability in the case of corrupted or partially persisted CrawlURIs. #693
- BeanDocProcessor: Fixed compiler IllegalArgumentException when IntelliJ runs the annotation processor with a ProcessingEnvironment wrapper.
Dependency upgrades
- amqp-client: 5.27.0 → 5.27.1
- commons-cli: 1.10.0 → 1.11.0
- commons-codec: 1.19.0 → 1.20.0
- commons-io: 2.20.0 → 2.21.0
- jackson: 2.20.0 → 2.20.1
- jetty: 12.0.29 → 12.0.30
- jsch: 2.27.4 → 2.27.7
- junit-jupiter: 6.0.0 → 6.0.1
- kafka-clients: 4.1.0 → 4.1.1
- lz4-java: 1.8.0 → 1.10.1
- spring-framework: 6.2.12 → 7.0.1
- webarchive-commons: 3.0.1 → 3.0.2
3.12.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New features
- ConfigurableExtractorJS: Regex rules to skip extracting
<script>tags when their attributes match. #672
Bug fixes
- Docs: Switch bean docs generation to an annotation processor, fixing the bean reference broken by Java language changes. #683
- StatisticsTracker: Don’t restore
crawlEndTimewhen resuming from a checkpoint. #669 - ExtractorJS: Fix overriding the
strictsetting in sheets. #670 - Berkeley DB: Handle more shutdown interrupts gracefully. #671
Dependency upgrades
- amqp-client: 5.26.0 → 5.27.0
- groovy: 4.0.28 → 5.0.2
- jaxb-runtime: 4.0.5 → 4.0.6
- jetty: 12.0.27 → 12.0.29
- jsch: 2.27.3 → 2.27.4
- junit-jupiter: 5.13.4 → 6.0.0
- kafka-clients: 3.9.1 → 4.1.0
- pdfbox: 3.0.5 → 3.0.6
- rethinkdb-driver: 2.3.3 → 2.4.4
- spring: 6.2.11 → 6.2.12
- webarchive-commons: 3.0.0 → 3.0.1
- webjars-locator-lite: 1.1.0 → 1.1.2
3.11.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New features
- KnowledgableExtractorJS now extends ConfigurableExtractorJS for its additional options. #668
Bug fixes
- Invalid characters are now stripped from the XML REST API output. Log file truncation after an unclean shutdown can sometimes introduce such characters. #667
Dependency upgrades
- codemirror@language: 6.11.2 → 6.11.3
- jakarta.xml.bind-api: 4.0.2 → 4.0.4
- jetty: 12.0.25 → 12.0.27
- jsch: 2.27.2 → 2.27.3
- gson: 2.13.1 → 2.13.2
- spring: 6.2.10 → 6.2.11
3.10.2
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
Bug fixes
- AMQPPublishProcessor: The User-Agent string is now included in the metadata so Umbra can use it in its own requests. #663
- FetchDNS: DNS lookups returning
0.0.0.0are now treated as resolution failure. #665
Dependency upgrades
- amqp-client: 5.25.0 → 5.26.0
- codemirror@language: 6.11.1 → 6.11.2
- codemirror@legacy-modes: 6.5.0 → 6.5.1
- codemirror@view: 6.37.2 → 6.38.1
- commons-cli: 1.9.0 → 1.10.0
- commons-codec: 1.18.0 → 1.19.0
- commons-net: 3.11.1 → 3.12.0
- jetty: 12.0.22 → 12.0.25
- junit-jupiter: 5.13.3 → 5.13.4
- groovy: 4.0.27 → 4.0.28
- spring-framework: 6.2.9 → 6.2.10
3.10.1
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
Bug fixes
-
FetchHTTP2
- HTTP/1.1 is now used on servers that don't support ALPN. Fixes
IOException: frame_size_error/invalid_frame_length - Fixed NullPointerException when the server's IP address isn't available.
- HTTP/1.1 is now used on servers that don't support ALPN. Fixes
-
Seeds report: Redirect URIs are now recorded from the
Locationheader for HTTP status codes303 See other,
307 Temporary Redirectand308 Permanent Redirect.
Previously this was only done for301 Moved Permanentlyand302 Found. -
Public suffixes list: A resource naming conflict between webarchive-commons and crawler-commons for
effective_tld_names.datwas resolved and the list was updated to the latest version.
Dependency upgrades
- codemirror@state: 6.4.0 → 6.5.11
- codemirror@view: 6.37.1 → 6.37.2
- commons-lang: 2.6 → 3.18.0
- commons-io: 2.19.0 → 2.20.0
- crawler-commons: 1.4 → 1.5
- jetty: 12.0.17 → 12.0.22
- jsch: 2.27.0 → 2.27.2
- junit-jupiter: 5.13.2 → 5.13.3
- restlet: 2.6.0-rc1 → 2.6.0
- spring: 6.2.7 → 6.2.9
- webarchive-commons: 2.0.1 → 3.0.0
3.10.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New features
-
BrowserProcessor: Loads fetched pages in a local browser (Firefox/ChromeDriver), records all browser requests,
and runs pluggable behaviors (e.g. scrolling, link extraction). #653- Uses the WebDriver BiDi protocol for browser automation.
- The recording proxy is built on Jetty's ProxyHandler and the FetchHTTP2 module.
- Status: Working for small crawls but needs more robust error handling (browser crashes, resource limits).
-
Basic web auth: You can now switch the web interface from Digest authentication to Basic authentication with the
--web-auth basiccommand-line option. This is useful when running Heritrix behind a reverse proxy that adds external authentication. #654 -
Robots.txt wildcards: The
*and$wildcard rules from RFC 9309 are now supported. #656 -
FetchHTTP2: Added HTTP proxy support. #657
Fixes
-
Code editor: The configuration editor and script console were upgraded to CodeMirror 6. This resolves some browser incompatibilities, allowing CodeMirror’s own find function to be re-enabled for reliable text search of content far outside the viewport. #651
-
BDB shutdown interrupt handling: The thread’s interrupted flag is now cleared before some BDB interactions to reduce the likelihood of environment invalidation when requestCrawlStop() is called repeatedly. #659
-
FetchHTTP2: Fixed gzip alert log messages by configuring HttpClient to not decode gzip encoding from response.
Removals
-
Removed Apache HttpClient 3: If you have custom Heritrix modules you may need to update the following
class references in your code:Removed Replacement org.apache.commons.httpclient.URIExceptionorg.archive.url.URIExceptionorg.apache.commons.httpclient.Headerorg.archive.format.http.HttpHeaderNote that Apache HttpClient 4 (
org.apache.http) was not removed. #652
Dependency Upgrades
- codemirror: 2.23 → 6
- easymock: 5.5.0 → removed
- groovy: 4.0.26 → 4.0.27
- junit: 5.12.2 → 5.13.1
- kafka-clients: 3.9.0 → 3.9.1
- spring: 6.2.6 → 6.2.7
- webarchive-commons: 1.3.0 → 2.0.1
3.9.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New features
- FetchHTTP2: Added a new fetch module supporting HTTP/2 and HTTP/3. #649
Fixes
- Fixed HighestUriPrecedenceProvider: Added Histotable serializer and Kryo autoregistration. #647
Changes
- JUnit 5: Upgraded all JUnit 3 and 4 style tests to JUnit 5. #650
Dependency Upgrades
- commons-io: 2.18.0 → 2.19.0
- gson: 2.12.1 → 2.13.1
- jetty: 9.4.19.v20190610 → 12.0.17
- jsch: 0.2.24 → 2.27.0
- junit: 4.13.2 → 5.12.2
- pdfbox: 3.0.4 → 3.0.5
- restlet: 2.5.0 → 2.6.0-RC1
- spring: 6.2.5 → 6.2.6
3.8.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New Features
- ExtractorYoutubeDL processArguments: New option for overriding the default
yt-dlpprocess arguments. #644
Fixes
- Slow tests: Fixed
ObjectIdentityBdbManualCacheTestso it no longer fails when running tests with-DrunSlowTests=true. #643 - Test stability: Disabled
FetchHTTPTest.testHostHeaderDefaultPortdue to sporadic test failures. - Code cleanup: Fixed some compiler and IDE warnings. Removed unused utility classes (JavaLiterals, LogUtils). #645
Dependency Upgrades
- amqp-client: 5.24.0 → 5.25.0
- beanshell: 2.0b5 → 2.0b6
- commons-codec: 1.17.2 → 1.18.0
- dnsjava: 3.6.2 → 3.6.3
- groovy: 4.0.24 → 4.0.26
- gson: 2.11.0 → 2.12.1
- jsch: 0.2.22 → 0.2.24
- pdfbox: 3.0.3 → 3.0.4
- slf4j: 2.0.16 → 2.0.17
- spring: 6.1.16 → 6.2.5
3.7.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New Features
-
Groovy crawl configs (experimental): Groovy Bean Definition DSL can now be used as an experimental alternative to Spring XML. This enables more terse and human-readable job configuration with inline scripting capabilities. There is no user interface for it in this release. For now, you must manually create a crawler-beans.groovy file in your job directory. #632
-
ExtractorHTML obeyRelNofollow: This option skips extraction of links marked
rel=nofollow. This is useful for avoiding crawler traps on some sites. #638
Fixes
- Cookie rejected warning: The slf4j change in 3.6.0 inadvertently caused a previously hidden warning to be logged to
job.logwhen a server sends aSet-Cookieheader with a disallowed domain value. This warning is now suppressed since it occurs frequently and does not require any action from the crawl operator. #640
Changes
- Removed fastutil: A small number of usages of fastutil were replaced with standard library equivalents in webarchive-commons and Heritrix. This reduced the Heritrix distribution size from 51 MB to 34 MB. iipc/webarchive-commons#101
Dependency Upgrades
- amqp-client 5.24.0
- commons-codec 1.17.2
- ftpserver-core 1.2.1
- freemarker 2.3.34
- jetty 9.4.57.v20241219
- jsch 0.2.22
- restlet 2.5.0
- spring 6.1.16
- webarchive-commons 1.3.0
3.6.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
Java Compatibility Notice
This release of Heritrix requires Java 17 or later.
New Features
- Automatic Checkpoints on Shutdown: Added
checkpointOnShutdownoption toCheckpointServiceto enable automatic checkpoints if Heritrix is gracefully terminated. #626 - Command-Line Checkpoint Selection: The
--checkpointcommand-line option restarts from a named checkpoint when using the--run-joboption. #626 - ConfigurableExtractorJS forceStrictIfUrlMatchingRegexList: URLs matching the regular expressions on this list will be processed in strict mode, with only absolute URLs extracted, not relative ones. #624
Changes
- Upgraded to Spring Framework 6.1: The Spring
@Requiredannotation has been removed, so it was replaced with a custom implementation to maintain backward compatibility with existing crawl configurations. Spring 6 requires Java 17 so Heritrix does now too. #625
Fixes
- Manifest Hop Priority: Links from sitemaps are now given the same priority as normal navigation links. They were incorrectly being prioritized as transitive hops (embeds). #623
- SLF4J Logging: Heritrix now includes
slf4j-jdk14to eliminate a startup warning message and fix logging for dependencies (such as crawler-commons) that use SLF4J. Heritrix doesn't use SLF4J itself. #628
Dependency Upgrades
- amqp-client 5.23.0
- commons-cli 1.9.0
- commons-codec 1.17.1
- commons-io 2.18.0
- commons-net 3.11.1
- crawler-commons 1.4
- dnsjava 3.6.2
- easymock 5.5.0
- freemarker 2.3.33
- groovy 4.0.24
- gson 2.11.0
- httpcomponents 4.5.14
- java-socks-proxy-server 4.1.2
- java-websocket removed
- jaxb-runtime 4.0.5
- jsch switched to mwiede fork 0.2.21
- junit 4.13.2
- kafka-clients 3.9.0
- kryo 5.6.2
- pdfbox 3.0.3
- slf4j 2.0.16
- spring-framework 6.1.15
- webarchive-commons 1.2.0