-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Labels
Description
Following #299
- Name fields so the common ones are consistent with the CDXJ specification, i.e. make this an extension of CDXJ.
- Document usage, noting that dynamic content can't be easily extracted because it's dynamic.
- Store
content_first_byteswithout spaces? Storecontent_ffbas well or instead ofcontent_first_bytes? - Consider allowing payload inclusion if small, e.g. smaller HTML files or initial binary chunk.
- Consider extending the API so consumers can use the reference (name/offset) to get the payload InputStream.
- Include
warcinforecords in the JSONL output (currently skipped by thewindex.extract). - Should
boiler_pipeextraction be used? - Should extracted links be normalised?
- Should image and/or PDF analysis be enabled?
- Should the original payload be included if small enough? Or just for text?
- Should there be an option to only output the term frequency or colocation statistics of the text? So we can do this for everything? Perhaps that's better as a post-processing step?
- Both the Tika configuration
extract_all_metadataand the experimental WARCStats code show there are lots of other metadata fields that might be of interest. These could be stored in some kind of hash, but not that Parquet/Avro schema reflection does not support hashes directly. TheMementoRecordclass illustrates that theMementobean could be implemented on top of an extensible hash-map, which might make dynamic Parquet schema generation possible.
Example WARC Stats code output:
INFO WARCStatsToolIntegrationTest - {"timestamp":"20080430204830","url":"http:\/\/www.archive.org\/services\/collection-rss.php","source-file":"hdfs:\/\/localhost:58536\/user\/anj\/inputs\/IAH-20080430204825-00000-blackbook-truncated.warc.gz","content-type":"application\/http; msgtype=response","content-length":"50832","length":"50831","source-offset":"18283","HEADER-reader-identifier":"IAH-20080430204825-00000-blackbook-truncated.warc.gz","HEADER-WARC-Payload-Digest":"sha1:JXXJNHJX4GEM44C4NOM3RJWKMKVBIGHF","HEADER-WARC-IP-Address":"207.241.229.39","HEADER-absolute-offset":"18283","HEADER-WARC-Target-URI":"http:\/\/www.archive.org\/services\/collection-rss.php","HEADER-WARC-Date":"2008-04-30T20:48:30Z","HEADER-Content-Length":"50832","HEADER-WARC-Record-ID":"<urn:uuid:8399ab93-1fee-4787-aa60-0f1ce83cb885>","HEADER-WARC-Type":"response","HEADER-Content-Type":"application\/http; msgtype=response","record-type":"warc.response","digest":"sha1:JXXJNHJX4GEM44C4NOM3RJWKMKVBIGHF","status-code":"200","HTTP-Date":"Wed, 30 Apr 2008 20:48:29 GMT","HTTP-Server":"Apache\/2.0.54 (Ubuntu) PHP\/5.0.5-2ubuntu1.4 mod_ssl\/2.0.54 OpenSSL\/0.9.7g","HTTP-X-Powered-By":"PHP\/5.0.5-2ubuntu1.4","HTTP-Connection":"close","HTTP-Content-Type":"text\/xml","host":"www.archive.org","year":"2008"}
INFO WARCStatsToolIntegrationTest - {"timestamp":"20080430204825","url":"http:\/\/www.archive.org\/robots.txt","source-file":"hdfs:\/\/localhost:58536\/user\/anj\/inputs\/IAH-20080430204825-00000-blackbook-truncated.warc.gz","content-type":"application\/http; msgtype=response","content-length":"782","length":"781","source-offset":"707","HEADER-reader-identifier":"IAH-20080430204825-00000-blackbook-truncated.warc.gz","HEADER-WARC-Payload-Digest":"sha1:SUCGMUVXDKVB5CS2NL4R4JABNX7K466U","HEADER-WARC-IP-Address":"207.241.229.39","HEADER-absolute-offset":"707","HEADER-WARC-Target-URI":"http:\/\/www.archive.org\/robots.txt","HEADER-WARC-Date":"2008-04-30T20:48:25Z","HEADER-Content-Length":"782","HEADER-WARC-Record-ID":"<urn:uuid:e7c9eff8-f5bc-4aeb-b3d2-9d3df99afb30>","HEADER-WARC-Type":"response","HEADER-Content-Type":"application\/http; msgtype=response","record-type":"warc.response","digest":"sha1:SUCGMUVXDKVB5CS2NL4R4JABNX7K466U","status-code":"200","HTTP-Date":"Wed, 30 Apr 2008 20:48:24 GMT","HTTP-Server":"Apache\/2.0.54 (Ubuntu) PHP\/5.0.5-2ubuntu1.4 mod_ssl\/2.0.54 OpenSSL\/0.9.7g","HTTP-Last-Modified":"Sat, 02 Feb 2008 19:40:44 GMT","HTTP-ETag":"\"47c3-1d3-11134700\"","HTTP-Accept-Ranges":"bytes","HTTP-Content-Length":"467","HTTP-Connection":"close","HTTP-Content-Type":"text\/plain; charset=UTF-8","host":"www.archive.org","year":"2008"}