Skip to content

Commit e5d122a

Browse files
authored
feat: add --crawl (#39)
1 parent 83d1f31 commit e5d122a

File tree

10 files changed

+276
-78
lines changed

10 files changed

+276
-78
lines changed

.github/workflows/ci.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ jobs:
1414
runs-on: ${{ matrix.os }}
1515
strategy:
1616
fail-fast: false
17-
#max-parallel: 1
17+
max-parallel: 1 # avoids ever triggering a rate limit
1818
matrix:
1919
python-version: ['3.7', '3.8', '3.9', '3.10', '3.11', '3.12']
2020
os: [ubuntu-latest]

CHANGELOG.md

+4-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
1+
- 0.9.37
2+
+ --crawl for CCF
3+
14
- 0.9.36
25
+ ratelimit code; both IA and CCF are rate limiting their cdx endpoints
36
+ cache collinfo.json in ~/.cache/cdx_toolkit/
4-
+ py3.11 and py3.12 pass testing
7+
+ py3.11 and py3.12 pass testing; windows and macos pass testing
58

69
- 0.9.35
710
+ exponential backoff retries now that IA is sending 429

Makefile

+3-4
Original file line numberDiff line numberDiff line change
@@ -33,14 +33,13 @@ distcheck: distclean
3333
twine check dist/*
3434

3535
dist: distclean
36-
echo " Finishe CHANGELOG and commit it.
36+
echo " Finishe CHANGELOG.md and commit it."
3737
echo " git tag --list"
38-
echo " git tag v0.x.x"
38+
echo " git tag 0.x.x # no v"
3939
echo " git push --tags"
4040
python ./setup.py sdist
4141
twine check dist/*
4242
twine upload dist/* -r pypi
4343

4444
install:
45-
python ./setup.py install
46-
45+
pip install .

README.md

+76-21
Original file line numberDiff line numberDiff line change
@@ -3,61 +3,118 @@
33
[![build](https://github.com/cocrawler/cdx_toolkit/actions/workflows/ci.yaml/badge.svg)](https://github.com/cocrawler/cdx_toolkit/actions/workflows/ci.yaml) [![coverage](https://codecov.io/gh/cocrawler/cdx_toolkit/graph/badge.svg?token=M1YJB998LE)](https://codecov.io/gh/cocrawler/cdx_toolkit) [![Apache License 2.0](https://img.shields.io/github/license/cocrawler/cdx_toolkit.svg)](LICENSE)
44

55
cdx_toolkit is a set of tools for working with CDX indices of web
6-
crawls and archives, including those at CommonCrawl and the Internet
7-
Archive's Wayback Machine.
6+
crawls and archives, including those at the Common Crawl Foundation
7+
(CCF) and those at the Internet Archive's Wayback Machine.
88

9-
CommonCrawl uses Ilya Kreymer's pywb to serve the CDX API, which is
10-
somewhat different from the Internet Archive's CDX API server. cdx_toolkit
11-
hides these differences as best it can. cdx_toolkit also knits
12-
together the monthly Common Crawl CDX indices into a single, virtual
13-
index.
9+
Common Crawl uses Ilya Kreymer's pywb to serve the CDX API, which is
10+
somewhat different from the Internet Archive's CDX API server.
11+
cdx_toolkit hides these differences as best it can. cdx_toolkit also
12+
knits together the monthly Common Crawl CDX indices into a single,
13+
virtual index.
1414

1515
Finally, cdx_toolkit allows extracting archived pages from CC and IA
16-
into WARC files. If you're looking to create subsets of CC or IA data
17-
and then process them into WET or WAT files, this is a feature you'll
18-
find useful.
16+
into WARC files. If you're looking to create subsets of CC or IA data
17+
and then further process them, this is a feature you'll find useful.
1918

2019
## Installing
2120

22-
cdx toolkit requires Python 3.
23-
2421
```
2522
$ pip install cdx_toolkit
2623
```
2724

28-
or clone this repo and use `python ./setup.py install`.
25+
or clone this repo and use `pip install .`
2926

3027
## Command-line tools
3128

3229
```
3330
$ cdxt --cc size 'commoncrawl.org/*'
34-
$ cdxt --cc --limit 10 iter 'commoncrawl.org/*'
31+
$ cdxt --cc --limit 10 iter 'commoncrawl.org/*' # returns the most recent year
32+
$ cdxt --crawl 3 --limit 10 iter 'commoncrawl.org/*' # returns the most recent 3 crawls
3533
$ cdxt --cc --limit 10 --filter '=status:200' iter 'commoncrawl.org/*'
36-
$ cdxt --ia --limit 10 iter 'commoncrawl.org/*'
34+
35+
$ cdxt --ia --limit 10 iter 'commoncrawl.org/*' # will show the beginning of IA's crawl
3736
$ cdxt --ia --limit 10 warc 'commoncrawl.org/*'
3837
```
3938

4039
cdxt takes a large number of command line switches, controlling
4140
the time period and all other CDX query options. cdxt can generate
4241
WARC, jsonl, and csv outputs.
4342

44-
** Note that by default, cdxt --cc will iterate over the previous
45-
year of captures. **
43+
If you don't specify much about the crawls or dates or number of
44+
records you're interested in, some default limits will kick in to
45+
prevent overly-large queries. These default limits include a maximum
46+
of 1000 records (`--limit 1000`) and a limit of 1 year of CC indexes.
47+
To exceed these limits, use `--limit` and `--crawl` or `--from` and
48+
`--to`.
49+
50+
If it seems like nothing is happening, add `-v` or `-vv` at the start:
51+
52+
```
53+
$ cdxt -vv --cc size 'commoncrawl.org/*'
54+
```
55+
56+
## Selecting particular CCF crawls
57+
58+
Common Crawl's data is divided into "crawls", which were yearly at the
59+
start, and are currently done monthly. There are over 100 of them.
60+
[You can find details about these crawls here.](https://data.commoncrawl.org/crawl-data/index.html)
61+
62+
Unlike some web archives, CCF doesn't have a single CDX index that
63+
covers all of these crawls -- we have 1 index per crawl. The way
64+
you ask for a particular crawl is:
65+
66+
```
67+
$ cdxt --crawl CC-MAIN-2024-33 iter 'commoncrawl.org/*'
68+
```
69+
70+
- `--crawl CC-MAIN-2024-33` is a single crawl.
71+
- `--crawl 3` is the latest 3 crawls.
72+
- `--crawl CC-MAIN-2018` will match all of the crawls from 2018.
73+
- `--crawl CC-MAIN-2018,CC-MAIN-2019` will match all of the crawls from 2018 and 2019.
74+
75+
CCF also has a hive-sharded parquet index (called the columnar index)
76+
that covers all of our crawls. Querying broad time ranges is much
77+
faster with the columnar index. You can find more information about
78+
this index at [the blog post about it](https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format).
79+
80+
The Internet Archive cdx index is organized as a single crawl that goes
81+
from the very beginning until now. That's why there is no `--crawl` for
82+
`--ia`. Note that cdx queries to `--ia` will default to one year year
83+
and limit 1000 entries if you do not specify `--from`, `--to`, and `--limit`.
84+
85+
## Selecting by time
86+
87+
In most cases you'll probably use --crawl to select the time range for
88+
Common Crawl queries, but for the Internet Archive you'll need to specify
89+
a time range like this:
90+
91+
```
92+
$ cdxt --ia --limit 1 --from 2008 --to 200906302359 iter 'commoncrawl.org/*'
93+
```
94+
95+
In this example the time range starts at the beginning of 2008 and
96+
ends on June 30, 2009 at 23:59. All times are in UTC. If you do not
97+
specify a time range (and also don't use `--crawl`), you'll get the
98+
most recent year.
4699

47-
See
100+
## The full syntax for command-line tools
48101

49102
```
50103
$ cdxt --help
51104
$ cdxt iter --help
52105
$ cdxt warc --help
106+
$ cdxt size --help
53107
```
54108

55109
for full details. Note that argument order really matters; each switch
56110
is valid only either before or after the {iter,warc,size} command.
57111

58112
Add -v (or -vv) to see what's going on under the hood.
59113

60-
## Programming example
114+
## Python programming example
115+
116+
Everything that you can do on the command line, and much more, can
117+
be done by writing a Python program.
61118

62119
```
63120
import cdx_toolkit
@@ -231,5 +288,3 @@ distributed under the License is distributed on an "AS IS" BASIS,
231288
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
232289
See the License for the specific language governing permissions and
233290
limitations under the License.
234-
235-

cdx_toolkit/__init__.py

+13-6
Original file line numberDiff line numberDiff line change
@@ -197,12 +197,14 @@ def __next__(self):
197197
LOGGER.debug('getting more in __next__')
198198
self.get_more()
199199
if len(self.captures) <= 0:
200+
# XXX print out a warning if this hits the default limit of 1000
200201
raise StopIteration
201202

202203

203204
class CDXFetcher:
204-
def __init__(self, source='cc', wb=None, warc_download_prefix=None, cc_mirror=None, cc_sort='mixed', loglevel=None):
205+
def __init__(self, source='cc', crawl=None, wb=None, warc_download_prefix=None, cc_mirror=None, cc_sort='mixed', loglevel=None):
205206
self.source = source
207+
self.crawl = crawl
206208
self.cc_sort = cc_sort
207209
self.source = source
208210
if wb is not None and warc_download_prefix is not None:
@@ -211,12 +213,11 @@ def __init__(self, source='cc', wb=None, warc_download_prefix=None, cc_mirror=No
211213
self.warc_download_prefix = warc_download_prefix
212214

213215
if source == 'cc':
214-
self.cc_mirror = cc_mirror or 'https://index.commoncrawl.org/'
215-
self.raw_index_list = get_cc_endpoints(self.cc_mirror)
216216
if wb is not None:
217217
raise ValueError('cannot specify wb= for source=cc')
218+
self.cc_mirror = cc_mirror or 'https://index.commoncrawl.org/'
219+
self.raw_index_list = get_cc_endpoints(self.cc_mirror)
218220
self.warc_download_prefix = warc_download_prefix or 'https://data.commoncrawl.org'
219-
#https://commoncrawl.s3.amazonaws.com
220221
elif source == 'ia':
221222
self.index_list = ('https://web.archive.org/cdx/search/cdx',)
222223
if self.warc_download_prefix is None and self.wb is None:
@@ -230,8 +231,10 @@ def __init__(self, source='cc', wb=None, warc_download_prefix=None, cc_mirror=No
230231
LOGGER.setLevel(level=loglevel)
231232

232233
def customize_index_list(self, params):
233-
if self.source == 'cc' and ('from' in params or 'from_ts' in params or 'to' in params or 'closest' in params):
234+
if self.source == 'cc' and (self.crawl or 'crawl' in params or 'from' in params or 'from_ts' in params or 'to' in params or 'closest' in params):
234235
LOGGER.info('making a custom cc index list')
236+
if self.crawl and 'crawl' not in params:
237+
params['crawl'] = self.crawl
235238
return filter_cc_endpoints(self.raw_index_list, self.cc_sort, params=params)
236239
else:
237240
return self.index_list
@@ -243,6 +246,8 @@ def get(self, url, **kwargs):
243246
validate_timestamps(params)
244247
params['url'] = url
245248
params['output'] = 'json'
249+
if 'crawl' not in params:
250+
params['crawl'] = self.crawl
246251
if 'filter' in params:
247252
if isinstance(params['filter'], str):
248253
params['filter'] = (params['filter'],)
@@ -272,13 +277,15 @@ def iter(self, url, **kwargs):
272277
validate_timestamps(params)
273278
params['url'] = url
274279
params['output'] = 'json'
280+
if 'crawl' not in params:
281+
params['crawl'] = self.crawl
275282
if 'filter' in params:
276283
if isinstance(params['filter'], str):
277284
params['filter'] = (params['filter'],)
278285
params['filter'] = munge_filter(params['filter'], self.source)
279286

280287
if self.source == 'cc':
281-
apply_cc_defaults(params)
288+
apply_cc_defaults(params, crawl_present=bool(self.crawl))
282289

283290
index_list = self.customize_index_list(params)
284291
return CDXFetcherIter(self, params=params, index_list=index_list)

cdx_toolkit/cli.py

+6-2
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
import os
77

88
import cdx_toolkit
9+
from cdx_toolkit.commoncrawl import normalize_crawl
910

1011
LOGGER = logging.getLogger(__name__)
1112

@@ -17,13 +18,14 @@ def main(args=None):
1718
parser.add_argument('--verbose', '-v', action='count', help='set logging level to INFO (-v) or DEBUG (-vv)')
1819

1920
parser.add_argument('--cc', action='store_const', const='cc', help='direct the query to the Common Crawl CDX/WARCs')
21+
parser.add_argument('--crawl', action='store', help='crawl names (comma separated) or an integer for the most recent N crawls. Implies --cc')
2022
parser.add_argument('--ia', action='store_const', const='ia', help='direct the query to the Internet Archive CDX/wayback')
2123
parser.add_argument('--source', action='store', help='direct the query to this CDX server')
2224
parser.add_argument('--wb', action='store', help='direct replays for content to this wayback')
2325
parser.add_argument('--limit', type=int, action='store')
2426
parser.add_argument('--cc-mirror', action='store', help='use this Common Crawl index mirror')
2527
parser.add_argument('--cc-sort', action='store', help='default mixed, alternatively: ascending')
26-
parser.add_argument('--from', action='store') # XXX default for cc
28+
parser.add_argument('--from', action='store')
2729
parser.add_argument('--to', action='store')
2830
parser.add_argument('--filter', action='append', help='see CDX API documentation for usage')
2931
parser.add_argument('--get', action='store_true', help='use a single get instead of a paged iteration. default limit=1000')
@@ -93,13 +95,15 @@ def get_version():
9395

9496
def setup(cmd):
9597
kwargs = {}
96-
kwargs['source'] = cmd.cc or cmd.ia or cmd.source or None
98+
kwargs['source'] = 'cc' if cmd.crawl else cmd.cc or cmd.ia or cmd.source or None
9799
if kwargs['source'] is None:
98100
raise ValueError('must specify --cc, --ia, or a --source')
99101
if cmd.wb:
100102
kwargs['wb'] = cmd.wb
101103
if cmd.cc_mirror:
102104
kwargs['cc_mirror'] = cmd.cc_mirror
105+
if cmd.crawl:
106+
kwargs['crawl'] = normalize_crawl([cmd.crawl]) # currently a string, not a list
103107
if getattr(cmd, 'warc_download_prefix', None) is not None:
104108
kwargs['warc_download_prefix'] = cmd.warc_download_prefix
105109

0 commit comments

Comments
 (0)