Releases: centic9/CommonCrawlDocumentDownload
Releases · centic9/CommonCrawlDocumentDownload
1.0.0.11
- Update to latest crawl and disable throttling, seems not necessary cu…
- Don't add matches twice to file with matching mimetypes/extensions
- Add note about missing backoff and link to commoncrawl-fetcher-lite
- Update Github Action
- Update to JDK 17
- Migrate to JUnit 5
- Migrate to Apache Http Client 5
- Update third party libraries
Full Changelog: 1.0.0.10...1.0.0.11
1.0.0.9
- Switch to Gradle 7.6 and to the new maven-publish plugin
- Update third-party-libraries
- Update to more recent CC-MAIN
- Parse newer fields
- Adjust logging configuration
Full Changelog: 1.0.0.8...1.0.0.9
1.0.0.8
Intermediate release while switching to Gradle 7.6, not uploaded to Maven Central.
Full Changelog: 1.0.0.7...1.0.0.8
1.0.0.10
- Re-publish with correct artifactId
Full Changelog: 1.0.0.9...1.0.0.10
1.0.0.7
- Add Extension .pot for powerpoint
- Switch to CC-MAIN-2019-39
- Update third-party libraries
Full Changelog: 1.0.0.6...1.0.0.7
1.0.0.6
- Update 3rd party libraries
- Use common-crawl 2018-43 by default
- Write accumulated mimetypes to a separate text-file after each index-file
- Add some support for detecting duplicate files and moving them out of the list to not re-process the same file over and over by the post-processing steps
- Some small adjustments for behavior changes in Java 11