|
3 | 3 | [](https://github.com/cocrawler/cdx_toolkit/actions/workflows/ci.yaml) [](https://codecov.io/gh/cocrawler/cdx_toolkit) [](LICENSE)
|
4 | 4 |
|
5 | 5 | cdx_toolkit is a set of tools for working with CDX indices of web
|
6 |
| -crawls and archives, including those at CommonCrawl and the Internet |
7 |
| -Archive's Wayback Machine. |
| 6 | +crawls and archives, including those at the Common Crawl Foundation |
| 7 | +(CCF) and those at the Internet Archive's Wayback Machine. |
8 | 8 |
|
9 |
| -CommonCrawl uses Ilya Kreymer's pywb to serve the CDX API, which is |
10 |
| -somewhat different from the Internet Archive's CDX API server. cdx_toolkit |
11 |
| -hides these differences as best it can. cdx_toolkit also knits |
12 |
| -together the monthly Common Crawl CDX indices into a single, virtual |
13 |
| -index. |
| 9 | +Common Crawl uses Ilya Kreymer's pywb to serve the CDX API, which is |
| 10 | +somewhat different from the Internet Archive's CDX API server. |
| 11 | +cdx_toolkit hides these differences as best it can. cdx_toolkit also |
| 12 | +knits together the monthly Common Crawl CDX indices into a single, |
| 13 | +virtual index. |
14 | 14 |
|
15 | 15 | Finally, cdx_toolkit allows extracting archived pages from CC and IA
|
16 |
| -into WARC files. If you're looking to create subsets of CC or IA data |
17 |
| -and then process them into WET or WAT files, this is a feature you'll |
18 |
| -find useful. |
| 16 | +into WARC files. If you're looking to create subsets of CC or IA data |
| 17 | +and then further process them, this is a feature you'll find useful. |
19 | 18 |
|
20 | 19 | ## Installing
|
21 | 20 |
|
22 |
| -cdx toolkit requires Python 3. |
23 |
| - |
24 | 21 | ```
|
25 | 22 | $ pip install cdx_toolkit
|
26 | 23 | ```
|
27 | 24 |
|
28 |
| -or clone this repo and use `python ./setup.py install`. |
| 25 | +or clone this repo and use `pip install .` |
29 | 26 |
|
30 | 27 | ## Command-line tools
|
31 | 28 |
|
32 | 29 | ```
|
33 | 30 | $ cdxt --cc size 'commoncrawl.org/*'
|
34 |
| -$ cdxt --cc --limit 10 iter 'commoncrawl.org/*' |
| 31 | +$ cdxt --cc --limit 10 iter 'commoncrawl.org/*' # returns the most recent year |
| 32 | +$ cdxt --crawl 3 --limit 10 iter 'commoncrawl.org/*' # returns the most recent 3 crawls |
35 | 33 | $ cdxt --cc --limit 10 --filter '=status:200' iter 'commoncrawl.org/*'
|
36 |
| -$ cdxt --ia --limit 10 iter 'commoncrawl.org/*' |
| 34 | +
|
| 35 | +$ cdxt --ia --limit 10 iter 'commoncrawl.org/*' # will show the beginning of IA's crawl |
37 | 36 | $ cdxt --ia --limit 10 warc 'commoncrawl.org/*'
|
38 | 37 | ```
|
39 | 38 |
|
40 | 39 | cdxt takes a large number of command line switches, controlling
|
41 | 40 | the time period and all other CDX query options. cdxt can generate
|
42 | 41 | WARC, jsonl, and csv outputs.
|
43 | 42 |
|
44 |
| -** Note that by default, cdxt --cc will iterate over the previous |
45 |
| -year of captures. ** |
| 43 | +If you don't specify much about the crawls or dates or number of |
| 44 | +records you're interested in, some default limits will kick in to |
| 45 | +prevent overly-large queries. These default limits include a maximum |
| 46 | +of 1000 records (`--limit 1000`) and a limit of 1 year of CC indexes. |
| 47 | +To exceed these limits, use `--limit` and `--crawl` or `--from` and |
| 48 | +`--to`. |
| 49 | + |
| 50 | +If it seems like nothing is happening, add `-v` or `-vv` at the start: |
| 51 | + |
| 52 | +``` |
| 53 | +$ cdxt -vv --cc size 'commoncrawl.org/*' |
| 54 | +``` |
| 55 | + |
| 56 | +## Selecting particular CCF crawls |
| 57 | + |
| 58 | +Common Crawl's data is divided into "crawls", which were yearly at the |
| 59 | +start, and are currently done monthly. There are over 100 of them. |
| 60 | +[You can find details about these crawls here.](https://data.commoncrawl.org/crawl-data/index.html) |
| 61 | + |
| 62 | +Unlike some web archives, CCF doesn't have a single CDX index that |
| 63 | +covers all of these crawls -- we have 1 index per crawl. The way |
| 64 | +you ask for a particular crawl is: |
| 65 | + |
| 66 | +``` |
| 67 | +$ cdxt --crawl CC-MAIN-2024-33 iter 'commoncrawl.org/*' |
| 68 | +``` |
| 69 | + |
| 70 | +- `--crawl CC-MAIN-2024-33` is a single crawl. |
| 71 | +- `--crawl 3` is the latest 3 crawls. |
| 72 | +- `--crawl CC-MAIN-2018` will match all of the crawls from 2018. |
| 73 | +- `--crawl CC-MAIN-2018,CC-MAIN-2019` will match all of the crawls from 2018 and 2019. |
| 74 | + |
| 75 | +CCF also has a hive-sharded parquet index (called the columnar index) |
| 76 | +that covers all of our crawls. Querying broad time ranges is much |
| 77 | +faster with the columnar index. You can find more information about |
| 78 | +this index at [the blog post about it](https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format). |
| 79 | + |
| 80 | +The Internet Archive cdx index is organized as a single crawl that goes |
| 81 | +from the very beginning until now. That's why there is no `--crawl` for |
| 82 | +`--ia`. Note that cdx queries to `--ia` will default to one year year |
| 83 | +and limit 1000 entries if you do not specify `--from`, `--to`, and `--limit`. |
| 84 | + |
| 85 | +## Selecting by time |
| 86 | + |
| 87 | +In most cases you'll probably use --crawl to select the time range for |
| 88 | +Common Crawl queries, but for the Internet Archive you'll need to specify |
| 89 | +a time range like this: |
| 90 | + |
| 91 | +``` |
| 92 | +$ cdxt --ia --limit 1 --from 2008 --to 200906302359 iter 'commoncrawl.org/*' |
| 93 | +``` |
| 94 | + |
| 95 | +In this example the time range starts at the beginning of 2008 and |
| 96 | +ends on June 30, 2009 at 23:59. All times are in UTC. If you do not |
| 97 | +specify a time range (and also don't use `--crawl`), you'll get the |
| 98 | +most recent year. |
46 | 99 |
|
47 |
| -See |
| 100 | +## The full syntax for command-line tools |
48 | 101 |
|
49 | 102 | ```
|
50 | 103 | $ cdxt --help
|
51 | 104 | $ cdxt iter --help
|
52 | 105 | $ cdxt warc --help
|
| 106 | +$ cdxt size --help |
53 | 107 | ```
|
54 | 108 |
|
55 | 109 | for full details. Note that argument order really matters; each switch
|
56 | 110 | is valid only either before or after the {iter,warc,size} command.
|
57 | 111 |
|
58 | 112 | Add -v (or -vv) to see what's going on under the hood.
|
59 | 113 |
|
60 |
| -## Programming example |
| 114 | +## Python programming example |
| 115 | + |
| 116 | +Everything that you can do on the command line, and much more, can |
| 117 | +be done by writing a Python program. |
61 | 118 |
|
62 | 119 | ```
|
63 | 120 | import cdx_toolkit
|
@@ -231,5 +288,3 @@ distributed under the License is distributed on an "AS IS" BASIS,
|
231 | 288 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
232 | 289 | See the License for the specific language governing permissions and
|
233 | 290 | limitations under the License.
|
234 |
| - |
235 |
| - |
|
0 commit comments