SiteOne Crawler: Text Output Documentation

1. Introduction

The text output begins with an ASCII art logo, version information, and the author's contact details. This is followed by several sections detailing various aspects of the crawled website. The primary sections include:

Progress Report: Real-time status of crawled URLs.
Skipped URLs Summary: Aggregated counts of URLs skipped for various reasons.
Skipped URLs: Detailed list of skipped URLs, reasons, and sources.
Redirected URLs: List of URLs that resulted in redirects.
404 URLs: List of URLs that returned a 404 Not Found status.
SSL/TLS Info: Details about the website's SSL/TLS certificate.
Performance Metrics: Top fastest and slowest URLs.
SEO & Content Analysis: SEO metadata, OpenGraph metadata, heading structure, and duplicate content reports.
HTTP Headers: Analysis of HTTP headers found during the crawl.
HTTP Caching: Detailed breakdown of caching strategies by content type and domain.
Best Practices: Results of various best practice checks.
Accessibility: Results of accessibility checks.
Source Domains: Summary of crawled domains.
Content Types: Summary of crawled content types (general and MIME types).
DNS Info: Information about DNS resolution.
Security: Results of security header checks.
Analysis Stats: Performance statistics for the crawler's internal analyzers.

2. General Format

The output uses simple text formatting:

Headers: Section titles are typically preceded by --- lines and followed by === or --- lines for visual separation.
Tables: Data is presented in fixed-width tables with headers underlined by hyphens (-). Column alignment is maintained using spaces. This documentation uses Markdown tables for examples.
Truncation: Some tables containing potentially large amounts of data (like SEO metadata or heading structures) might show only a limited number of rows (e.g., max 10) in the text output, with a note advising the use of the HTML report (--output-html-report) for the complete data.

3. Detailed Section Breakdown

3.1. Progress Report

This section shows the progress of the crawl in real-time (or the final state if the crawl is complete).

Progress	%	Bar	URL	Status	Type	Time	Size	Cache	Access.	Best pr.
1/43	2%		/	200	HTML	34 ms	45 kB	60 min	1/1/2	1/5
2/64	3%		/installation-and-requirements/ready-to-use-packages/	200	HTML	13 ms	61 kB	60 min	1/3	1/5
...	...	...	...	...	...	...	...	...	...	...

Progress report: Columns include:
- Progress (X/Y): X = URL sequence number, Y = Total URLs found so far.
- %: Percentage of URLs processed relative to the total found.
- Bar: Visual indicator (>, >>, etc.).
- URL: The path or full URL being processed.
- Status: HTTP status code returned (e.g., 200, 404, 301).
- Type: Detected content type (e.g., HTML, JS, CSS, Image).
- Time: Time taken to download the URL.
- Size: Size of the downloaded content.
- Cache: Detected cache lifetime (e.g., 60 min, 12 mon, none).
- Access.: Accessibility issues summary (OK/Notice/Warning/Critical).
- Best pr.: Best practices issues summary (OK/Notice/Warning/Critical).

3.2. Skipped URLs Summary

Provides a high-level overview of why URLs were skipped during the crawl, grouped by reason and domain.

Skipped URLs Summary

Reason	Domain	Unique URLs
Not allowed host	nextjs.org	1294
Not allowed host	astro.build	925
Robots.txt	crawler.siteone.io	3
...	...	...

Reason: Why the URL was skipped (e.g., Not allowed host, Robots.txt, Max depth reached).
Domain: The domain of the skipped URLs.
Unique URLs: The count of unique URLs skipped for that reason/domain combination.

3.3. Skipped URLs

Lists individual skipped URLs with more context.

Skipped URLs

Reason	Skipped URL	Source	Found at URL
Not allowed host	http://astro.build/	`<a href>`	/html/2024-08-24/forever/hwzxj1-qrs69-1fqlxbd.html
Not allowed host	https://adamwathan.me/	`<a href>`	/introduction/thanks/
...	...	...	...

Reason: Why the URL was skipped.
Skipped URL: The specific URL that was not crawled.
Source: How the URL was discovered (e.g., <a href>, <img src>, CSS url()).
Found at URL: The URL where the skipped URL was found.

3.4. Redirected URLs

Lists URLs that resulted in an HTTP redirect. (Empty in the example, but would follow a similar table format if redirects were found).

3.5. 404 URLs

Lists URLs that returned a 404 Not Found status code.

404 URLs

Status	URL 404	Found at URL
404	https://crawler.siteone.io/html/2024-08-23/forever/httpAgentOptions	https://crawler.siteone.io/html/2024-08-23/forever/cl8xw4r-fdag8wg-44dd.html
...	...	...

Status: The HTTP status code (typically 404).
URL 404: The URL that resulted in the 404 error.
Found at URL: The URL containing the link to the broken page.

3.6. SSL/TLS Info

Provides details about the SSL/TLS certificate of the primary host.

SSL/TLS info

Info	Text
Issuer	C = BE, O = GlobalSign nv-sa, CN = GlobalSign GCC R6 AlphaSSL CA 2023
Subject	CN = *.siteone.io
Valid from	Jan 23 09:52:19 2025 GMT (VALID already 73.2 day(s))
Valid to	Feb 24 09:52:18 2026 GMT (VALID still for 323.8 day(s))
Supported protocols	TLSv1.2
RAW certificate output	`Certificate: ...` (details omitted)
RAW protocols output	`=== ssl2 === ...` (details omitted)

Info: The type of information (Issuer, Subject, Validity dates, Supported protocols).
Text: The corresponding value for the information type. Includes raw output snippets.

3.7. Performance Metrics (Fastest/Slowest URLs)

Two tables listing the top N fastest and slowest URLs encountered during the crawl.

TOP fastest URLs

Time	Status	Fast URL
11 ms	200	https://crawler.siteone.io/installation-and-requirements/desktop-application/
...	...	...

TOP slowest URLs

Time	Status	Slow URL
2.7 s	200	https://crawler.siteone.io/html/2024-08-24/forever/hwzxj1-qrs69-1fqlxbd.html
...	...	...

Time: Time taken to download the URL.
Status: HTTP status code.
Fast/Slow URL: The URL itself.

3.8. SEO & Content Analysis

Includes several sub-sections like SEO metadata, OpenGraph metadata, heading structure, and duplicate content reports. (Note: These tables are often truncated in the text output and are not shown here in Markdown format for brevity).

3.9. HTTP Headers

Analyzes HTTP response headers across all crawled URLs.

HTTP headers (Summary): Lists unique headers, occurrence count, unique value count, preview of values, and min/max values where applicable (e.g., for Content-Length or dates).
HTTP header values (Detailed): Lists specific values for headers with multiple distinct values, along with their occurrence counts.

HTTP headers (Summary Example)

Header	Occurs	Unique	Values preview	Min value	Max value
Accept-Ranges	12	1	bytes
Cache-Control	66	2	max-age=3600 (49) / max-age=31536000 (17)
...	...	...	...	...	...

HTTP header values (Detailed Example)

Header	Occurs	Value
Accept-Ranges	12	bytes
Cache-Control	49	max-age=3600
...	...	...

3.10. HTTP Caching

Provides detailed analysis of HTTP caching headers.

HTTP Caching by content type: Summarizes caching strategies (e.g., Cache-Control + ETag, No cache headers) used for different content types (HTML, CSS, JS, Image, etc.), including counts and average/min/max lifetimes.
HTTP Caching by domain: Similar summary, but grouped by domain.
HTTP Caching by domain and content type: The most granular view, showing caching strategies for each content type within each domain.

HTTP Caching by content type (Example)

Content type	Cache type	URLs	AVG lifetime	MIN lifetime	MAX lifetime
HTML	Cache-Control + ETag + Last-Modified	45	60 min	60 min	60 min
Image	Cache-Control + ETag + Last-Modified	11	12 mon	12 mon	12 mon
...	...	...	...	...	...

3.11. Best Practices

Summarizes results from various best practice checks.

Best practices (Example)

Analysis name	OK	Notice	Warning	Critical
Large inline SVGs (> 5120 B)	148	0	108	0
Invalid inline SVGs	63	0	193	0
Heading structure	0	47	0	0
...	...	...	...	...

Analysis name: The specific check performed.
OK / Notice / Warning / Critical: Counts of URLs falling into each severity category for that check.

3.12. Accessibility

Summarizes results from accessibility checks.

Accessibility (Example)

Analysis name	OK	Notice	Warning	Critical
Missing image alt attributes	1419	0	2	0
Missing html lang attribute	0	0	0	1
...	...	...	...	...

Analysis name: The specific accessibility check.
OK / Notice / Warning / Critical: Counts for each severity level.

3.13. Source Domains

Lists all domains from which resources were successfully crawled, with counts and size/time summaries per content type.

Source domains (Example)

Domain	Totals	HTML	Image	JS	CSS	Document	JSON
crawler.siteone.io	67/30MB/6.2s	48/12MB/4s	11/18MB/2s	4/13kB/54ms	2/77kB/41ms	1/135B/10ms	1/36B/14ms

3.14. Content Types

Summarizes crawled resources by content type.

Content types (General): Groups by broad categories (HTML, Image, JS, CSS, etc.).
Content types (MIME types): Groups by specific MIME types (e.g., text/html, image/jpeg, application/javascript).

Content types (General Example)

Content type	URLs	Total size	Total time	Avg time	Status 20x	Status 40x
HTML	48	12 MB	4 s	84 ms	48	0
Image	11	18 MB	2 s	185 ms	11	0
...	...	...	...	...	...	...

Content types (MIME types Example)

Content type	URLs	Total size	Total time	Avg time	Status 20x	Status 40x
text/html	45	2 MB	1.2 s	26 ms	45	0
application/javascript	4	13 kB	54 ms	14 ms	4	0
...	...	...	...	...	...	...

3.15. DNS Info

Shows the DNS resolution tree for the crawled domain(s) and the DNS server used. (Not a table format).

DNS info
--------

DNS resolving tree                                                    
------------------------------------------------------------------------
crawler.siteone.io                                                    
  IPv4: 86.49.167.242                                                 
                                                                      
DNS server: 10.255.255.254

3.16. Security

Reports on the presence and configuration of important security-related HTTP headers.

Security (Example)

Header	OK	Notice	Warning	Critical	Recommendation
Strict-Transport-Security	45	0	0	3	Strict-Transport-Security header is not set. It enforces secure connections and protects against MITM attacks.
X-XSS-Protection	45	0	0	3	X-XSS-Protection header is not set. It enables browser's built-in defenses against XSS attacks.
...	..	...	...	...	...

Header: The security header being checked.
OK / Notice / Warning / Critical: Counts based on the header's presence and configuration.
Recommendation: Suggestion for improvement if issues are found.

3.17. Analysis Stats

Provides performance metrics for the crawler's internal analysis modules. Useful for debugging the crawler itself.

Analysis stats (Example)

Class::method	Exec time	Exec count
Manager::parseDOMDocument	707 ms	48
SslTlsAnalyzer::getTLSandSSLCertificateInfo	215 ms	1
...	...	...

4. Information Obtainable from Text Output

The text output provides a wealth of information about a website, including:

Crawl Overview: Number of pages found, processed, and skipped.
Website Structure: Implicitly through the list of crawled URLs and their relationships (via "Found at URL").
Link Health: Identification of broken links (404s) and redirects.
External Dependencies: List of external domains linked to or hosting resources (from Skipped URLs).
Performance Bottlenecks: Identification of the slowest loading pages and resources.
Content Inventory: Summary of different content types (HTML, images, scripts, stylesheets) and their sizes/load times.
Basic SEO Health: Status of titles, descriptions, heading structures, and indexing directives.
OpenGraph Implementation: Presence and content of OG tags for social sharing.
Server Configuration: Insights into HTTP headers used, including caching and security headers.
Caching Strategy: Effectiveness of caching policies across different content types and domains.
Security Posture: Checks for essential security headers (HSTS, X-Frame-Options, etc.).
Accessibility Issues: High-level view of common accessibility problems (missing alt text, lang attributes).
Best Practice Adherence: Checks against common web development best practices.
SSL/TLS Certificate Status: Validity and issuer details of the site's certificate.

5. Use Cases for Text Output

The text output is valuable for various tasks:

Quick Website Health Check: Get a fast overview of major issues like 404s, slow pages, or critical security/accessibility warnings.
Identifying Broken Links: Easily spot and locate 404 errors using the dedicated section.
Performance Audit: Identify the slowest URLs to prioritize optimization efforts.
Basic SEO Audit: Check for duplicate titles/descriptions and analyze heading structures.
Security Header Review: Quickly verify the presence of important security headers.
Caching Policy Verification: Understand how caching is implemented across the site.
Pre/Post Deployment Checks: Compare outputs before and after changes to catch regressions.
Generating Simple Reports: Copy-paste relevant sections into emails or documents for concise reporting.
Troubleshooting Crawl Issues: Use skipped URLs and analysis stats to understand crawler behavior.
Command-Line Integration: Process the text output with standard command-line tools (grep, awk, sed) for specific data extraction or automated checks in simple scripts.

6. Note on JSON Output

While this document focuses on the text output, SiteOne Crawler also offers a JSON output format (--output-json-file). The JSON output contains much of the same information but in a structured format that is ideal for programmatic consumption, detailed data analysis, or integration with other tools and dashboards. For automated processing or complex data manipulation, the JSON output is generally preferred.

See the JSON Output Documentation for more details on the JSON format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TEXT-OUTPUT.md

TEXT-OUTPUT.md

SiteOne Crawler: Text Output Documentation

Table of Contents

1. Introduction

2. General Format

3. Detailed Section Breakdown

3.1. Progress Report

3.2. Skipped URLs Summary

3.3. Skipped URLs

3.4. Redirected URLs

3.5. 404 URLs

3.6. SSL/TLS Info

3.7. Performance Metrics (Fastest/Slowest URLs)

3.8. SEO & Content Analysis

3.9. HTTP Headers

3.10. HTTP Caching

3.11. Best Practices

3.12. Accessibility

3.13. Source Domains

3.14. Content Types

3.15. DNS Info

3.16. Security

3.17. Analysis Stats

4. Information Obtainable from Text Output

5. Use Cases for Text Output

6. Note on JSON Output

Files

TEXT-OUTPUT.md

Latest commit

History

TEXT-OUTPUT.md

File metadata and controls

SiteOne Crawler: Text Output Documentation

Table of Contents

1. Introduction

2. General Format

3. Detailed Section Breakdown

3.1. Progress Report

3.2. Skipped URLs Summary

3.3. Skipped URLs

3.4. Redirected URLs

3.5. 404 URLs

3.6. SSL/TLS Info

3.7. Performance Metrics (Fastest/Slowest URLs)

3.8. SEO & Content Analysis

3.9. HTTP Headers

3.10. HTTP Caching

3.11. Best Practices

3.12. Accessibility

3.13. Source Domains

3.14. Content Types

3.15. DNS Info

3.16. Security

3.17. Analysis Stats

4. Information Obtainable from Text Output

5. Use Cases for Text Output

6. Note on JSON Output