Releases: iipc/jwarc
Releases · iipc/jwarc
v0.8.0
New features
- Added accessor methods and toString() to WarcDigest
Bugs fixed
- WarcReader: Cope with channels that aren't actually seekable despite advertising it
Changes
- Moved network services into a new package
- Split WarcTool into separate files in a new package
v0.7.0
New features
- jwarc now includes a simple filter language for selecting matching WARC records.
jwarc filter 'warc-type != "request"'
jwarc filter ':status == 200 && http:content-type =~ "image/.*"'
long errors = reader.records().filter(WarcFilter.compile(":status >= 400")).count();
- Native binary builds of the jwarc CLI tool are now available for Linux and MacOS. These are built using GraalVM and do not require Java to be installed. (The cross-platform .jar is still the recommended version though.)
Changed
- Calling record.http() no longer invalidates record.body() although care must still be taken.
- Remove the HttpParser.Handler interface
v0.6.0
New features
- WarcServer now partially supports Memento: Link headers, timemap, Accept-Datetime in proxy mode
- WarcServer now indexes the time dimension
Bug fixes
- HttpServer was improperly parsing subsequent requests on a keep-alive connection.
Changes
- Write WARC/1.0 by default for compatibility with older tools
v0.5.0
New features
- WarcTool now includes a
recorder
command which runs a recording proxy server. - WarcTool now includes a
record
command which captures a page using headless Chrome. - MessageBody now supports random access (a variant implementing SeekableByteChannel is returned) when the underlying channel does and there's no chunked encoding or compression.
- HttpResponse now handles chunked encoding.
Bug fixes
- Percent (%) is now accepted by the parser in URLs (HTTP requests, ARC files).
- HttpResponse was calculating the body length incorrectly causing it to be truncated.
- Problem headers are stripped during replay.
- The rewriting JavaScript is now not injected in proxy mode.
WarcServer and WarcRecorder are still highly experimental and so are currently only available through the command-line tool. I intend to give them a public API in a future version jwarc. If you'd like to use them in the meantime please treat them as examples: copy and paste and modify to suit your needs.
v0.4.0
New features
- Added a
screenshot
command that uses headless chrome to screenshot each page in a WARC. - The replay server can now replay HTTPS resources in proxy mode by generating certificates on the fly.
Bug fixes
- The cdx command was corrected to use the compressed record length.
Changes
- WarcWriter.fetch now makes a HTTP/1.0 request to discourage chunked encoding (#2)
v0.3.0
Features
- Added a primitive replay server to WarcTool
- Added payloadType() to capture records
- Added ipAddress() to capture record builders
Bugs
- Fixed calculation bugs that was causing large message bodies to be misread
- WARC-Target-URIs are now parsed more leniently to better cope with files in the wild