Skip to content

Releases: iipc/jwarc

v0.8.0

15 Jan 15:33
Compare
Choose a tag to compare

New features

  • Added accessor methods and toString() to WarcDigest

Bugs fixed

  • WarcReader: Cope with channels that aren't actually seekable despite advertising it

Changes

  • Moved network services into a new package
  • Split WarcTool into separate files in a new package

v0.7.0

11 Mar 04:07
@ato ato
Compare
Choose a tag to compare

New features

  • jwarc now includes a simple filter language for selecting matching WARC records.
    • jwarc filter 'warc-type != "request"'
    • jwarc filter ':status == 200 && http:content-type =~ "image/.*"'
    • long errors = reader.records().filter(WarcFilter.compile(":status >= 400")).count();
  • Native binary builds of the jwarc CLI tool are now available for Linux and MacOS. These are built using GraalVM and do not require Java to be installed. (The cross-platform .jar is still the recommended version though.)

Changed

  • Calling record.http() no longer invalidates record.body() although care must still be taken.
  • Remove the HttpParser.Handler interface

v0.6.0

22 Feb 00:15
@ato ato
Compare
Choose a tag to compare

New features

  • WarcServer now partially supports Memento: Link headers, timemap, Accept-Datetime in proxy mode
  • WarcServer now indexes the time dimension

Bug fixes

  • HttpServer was improperly parsing subsequent requests on a keep-alive connection.

Changes

  • Write WARC/1.0 by default for compatibility with older tools

v0.5.0

18 Feb 09:22
@ato ato
Compare
Choose a tag to compare

New features

  • WarcTool now includes a recorder command which runs a recording proxy server.
  • WarcTool now includes a record command which captures a page using headless Chrome.
  • MessageBody now supports random access (a variant implementing SeekableByteChannel is returned) when the underlying channel does and there's no chunked encoding or compression.
  • HttpResponse now handles chunked encoding.

Bug fixes

  • Percent (%) is now accepted by the parser in URLs (HTTP requests, ARC files).
  • HttpResponse was calculating the body length incorrectly causing it to be truncated.
  • Problem headers are stripped during replay.
  • The rewriting JavaScript is now not injected in proxy mode.

WarcServer and WarcRecorder are still highly experimental and so are currently only available through the command-line tool. I intend to give them a public API in a future version jwarc. If you'd like to use them in the meantime please treat them as examples: copy and paste and modify to suit your needs.

v0.4.0

14 Feb 12:57
@ato ato
Compare
Choose a tag to compare

New features

  • Added a screenshot command that uses headless chrome to screenshot each page in a WARC.
  • The replay server can now replay HTTPS resources in proxy mode by generating certificates on the fly.

Bug fixes

  • The cdx command was corrected to use the compressed record length.

Changes

  • WarcWriter.fetch now makes a HTTP/1.0 request to discourage chunked encoding (#2)

v0.3.0

10 Feb 12:17
@ato ato
Compare
Choose a tag to compare

Features

  • Added a primitive replay server to WarcTool
  • Added payloadType() to capture records
  • Added ipAddress() to capture record builders

Bugs

  • Fixed calculation bugs that was causing large message bodies to be misread
  • WARC-Target-URIs are now parsed more leniently to better cope with files in the wild

v0.2.0

09 Feb 06:09
@ato ato
Compare
Choose a tag to compare

New features:

  • WarcWriter.fetch(uri) downloads a remote resource
  • The jwarc jar is now executable and includes a basic CLI for some functions

Bugs fixed:

  • WarcWriter now correctly writes CRLFCRLF as the record trailer rather than LFLF