|
32 | 32 | - **JavaScript** parsing / crawling |
33 | 33 | - Customizable **automatic form filling** |
34 | 34 | - **Scope control** - Preconfigured field / Regex |
| 35 | + - **Knowledge base** - ML page-type / form classification (auto-downloaded model) |
35 | 36 | - **Customizable output** - Preconfigured fields |
36 | 37 | - INPUT - **STDIN**, **URL** and **LIST** |
37 | 38 | - OUTPUT - **STDOUT**, **FILE** and **JSON** |
38 | 39 |
|
39 | 40 | ## Installation |
40 | 41 |
|
41 | | -katana requires Go 1.25+ to install successfully. If you encounter any installation issues, we recommend trying with the latest available version of Go, as the minimum required version may have changed. Run the command below or download a pre-compiled binary from the [release page](https://github.com/projectdiscovery/katana/releases). |
| 42 | +katana requires Go 1.26+ to install successfully. If you encounter any installation issues, we recommend trying with the latest available version of Go, as the minimum required version may have changed. Run the command below or download a pre-compiled binary from the [release page](https://github.com/projectdiscovery/katana/releases). |
42 | 43 |
|
43 | 44 | ```console |
44 | 45 | CGO_ENABLED=1 go install github.com/projectdiscovery/katana/cmd/katana@latest |
@@ -620,6 +621,62 @@ Option to limit the number of pages crawled per domain. Prevents any single doma |
620 | 621 | katana -u https://tesla.com -mdp 100 |
621 | 622 | ``` |
622 | 623 |
|
| 624 | +## Knowledge Base Classification |
| 625 | +
|
| 626 | +Katana can enrich crawl results with a **knowledge base** — machine-learning classification of each crawled page powered by [dit](https://github.com/HappyHackingSpace/dit). When enabled, every response is classified by **page type** (e.g. `login`, `error`, `captcha`, `parked`) and any forms on the page are identified, with the result attached to the `knowledgebase` field of the JSONL output. This works across **all engines** (standard and headless). |
| 627 | +
|
| 628 | +> **Note**: The classification model is **downloaded automatically** on first use to `~/.dit/model.json` (from [Hugging Face](https://huggingface.co/datasets/happyhackingspace/dit)). This is a one-time, per-machine cost — subsequent runs reuse the cached model. No manual installation of `dit` is required. |
| 629 | +
|
| 630 | +*`-knowledge-base`* |
| 631 | +---- |
| 632 | +
|
| 633 | +Enable knowledge base classification. Page-type and form classification is added to the `knowledgebase` field of each result. |
| 634 | +
|
| 635 | +```console |
| 636 | +katana -u https://example.com -kb -jsonl |
| 637 | +``` |
| 638 | + |
| 639 | +```json |
| 640 | +{ |
| 641 | + "timestamp": "...", |
| 642 | + "request": { "...": "..." }, |
| 643 | + "response": { |
| 644 | + "...": "...", |
| 645 | + "knowledgebase": { |
| 646 | + "PageType": "login", |
| 647 | + "Forms": [{ "type": "login", "fields": { "username": "username or email", "password": "password" } }] |
| 648 | + } |
| 649 | + } |
| 650 | +} |
| 651 | +``` |
| 652 | + |
| 653 | +*`-filter-page-type`* |
| 654 | +---- |
| 655 | + |
| 656 | +Filter results to only the given page type(s). Enabling this implies `-kb` (the classifier is initialized automatically). |
| 657 | + |
| 658 | +```console |
| 659 | +katana -u https://example.com -fpt login,error |
| 660 | +``` |
| 661 | + |
| 662 | +*`-kb-secrets`* |
| 663 | +---- |
| 664 | + |
| 665 | +Enable the secrets extractor in the knowledge base, surfacing detected secrets (API keys, tokens, etc.) under the `secrets` key. Add `-kb-validate-secrets` to validate detected secrets against their provider — note this **sends live API calls**. |
| 666 | + |
| 667 | +```console |
| 668 | +katana -u https://example.com -kb-secrets |
| 669 | +``` |
| 670 | + |
| 671 | +*`-kb-endpoints`* |
| 672 | +---- |
| 673 | + |
| 674 | +Enable the endpoints extractor, which classifies requests as REST, GraphQL, SOAP, or XHR under the `endpoints` key. |
| 675 | + |
| 676 | +```console |
| 677 | +katana -u https://example.com -kb-endpoints |
| 678 | +``` |
| 679 | + |
623 | 680 | ## Authenticated Crawling |
624 | 681 |
|
625 | 682 | Authenticated crawling involves including custom headers or cookies in HTTP requests to access protected resources. These headers provide authentication or authorization information, allowing you to crawl authenticated content / endpoint. You can specify headers directly in the command line or provide them as a file with katana to perform authenticated crawling. |
|
0 commit comments