Skip to content

Commit 339fd72

Browse files
committed
docs(root): Update README
1 parent da5152d commit 339fd72

3 files changed

Lines changed: 203 additions & 1 deletion

File tree

.github/FUNDING.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
github: sudosubin

README.md

Lines changed: 202 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,202 @@
1-
# commoncrawl.cc
1+
<h1 align="center">commoncrawl.cc</h1>
2+
3+
<p align="center">
4+
A search-focused web console and API proxy for exploring Common Crawl index data.
5+
</p>
6+
7+
<p align="center">
8+
<a href="https://github.com/sudosubin/commoncrawl.cc/actions/workflows/ci.yml"><img alt="CI" src="https://github.com/sudosubin/commoncrawl.cc/actions/workflows/ci.yml/badge.svg" /></a>
9+
<a href="https://commoncrawl.cc"><img alt="Website" src="https://img.shields.io/badge/site-commoncrawl.cc-0f172a?style=flat&logo=cloudflarepages&logoColor=f38020" /></a>
10+
<a href="https://api.commoncrawl.cc/openapi.json"><img alt="OpenAPI" src="https://img.shields.io/badge/api-openapi.json-0369a1?style=flat&logo=openapiinitiative&logoColor=white" /></a>
11+
<a href="./LICENSE"><img alt="License" src="https://img.shields.io/badge/license-MIT-16a34a.svg" /></a>
12+
<a href="https://github.com/sponsors/sudosubin"><img alt="GitHub Sponsors" src="https://img.shields.io/badge/sponsor-GitHub%20Sponsors-ea4aaa?style=flat&logo=githubsponsors&logoColor=white" /></a>
13+
</p>
14+
15+
commoncrawl.cc makes Common Crawl index data easier to explore from the browser.
16+
It combines a fast web UI with a typed API proxy so you can inspect captures, timelines,
17+
and raw responses without manually stitching together index endpoints.
18+
19+
<p align="center">
20+
<a href="https://commoncrawl.cc/search?url=github.blog%2F*">
21+
<img alt="commoncrawl.cc" src="./assets/readme/commoncrawl-search-github-blog-polished.png" />
22+
</a>
23+
</p>
24+
25+
<p align="center">
26+
Example search workspace exploring <code>github.blog/*</code> snapshots, timeline metadata, and capture inspection.
27+
</p>
28+
29+
## Why this project exists
30+
31+
Common Crawl is incredibly useful, but its index APIs are still fairly low-level for day-to-day exploration.
32+
commoncrawl.cc aims to provide a cleaner workflow for developers, researchers, SEO teams, archivists,
33+
and data engineers who need to:
34+
35+
- search snapshot history for a URL
36+
- inspect capture timelines
37+
- fetch raw capture responses
38+
- experiment from a browser instead of ad-hoc scripts
39+
- build against a typed OpenAPI surface
40+
41+
## Features
42+
43+
- Search-focused UI for Common Crawl index exploration
44+
- Snapshot, timeline, and capture inspection workflows
45+
- Raw response preview for capture debugging
46+
- Cloudflare Worker API proxy for `index.commoncrawl.org`
47+
- Generated OpenAPI spec and typed web client
48+
- MSW-backed local mocking for frontend development
49+
- Cloudflare-based deployment workflow for API and web
50+
51+
## Live endpoints
52+
53+
- Web: https://commoncrawl.cc
54+
- API: https://api.commoncrawl.cc
55+
- OpenAPI: https://api.commoncrawl.cc/openapi.json
56+
57+
## Sponsors
58+
59+
commoncrawl.cc is maintained as an independent open source project.
60+
Sponsorship helps fund ongoing maintenance, UX improvements, API hardening, documentation,
61+
and the time required to keep the project useful and free for the community.
62+
63+
If your company uses Common Crawl for search, SEO, archival, research, data enrichment,
64+
or LLM pipelines, sponsoring this project is a practical way to support the tooling around that ecosystem.
65+
66+
<p align="center">
67+
<a href="https://github.com/sponsors/sudosubin">
68+
<img alt="Sponsor commoncrawl.cc" src="https://img.shields.io/badge/Sponsor%20commoncrawl.cc-GitHub%20Sponsors-ea4aaa?style=for-the-badge&logo=githubsponsors&logoColor=white" />
69+
</a>
70+
</p>
71+
72+
> No sponsors yet — your company can become the founding sponsor.
73+
74+
### Sponsor visibility
75+
76+
<table>
77+
<tr>
78+
<td align="center" width="33%">
79+
<a href="https://github.com/sponsors/sudosubin">
80+
<img alt="Founding sponsor slot" src="https://img.shields.io/badge/Founding%20Sponsor-Your%20logo%20here-111827?style=for-the-badge" />
81+
</a>
82+
<br />
83+
Top README placement
84+
</td>
85+
<td align="center" width="33%">
86+
<a href="https://github.com/sponsors/sudosubin">
87+
<img alt="Project sponsor slot" src="https://img.shields.io/badge/Project%20Sponsor-Your%20logo%20here-1f2937?style=for-the-badge" />
88+
</a>
89+
<br />
90+
Sponsor section placement
91+
</td>
92+
<td align="center" width="33%">
93+
<a href="https://github.com/sponsors/sudosubin">
94+
<img alt="Community sponsor slot" src="https://img.shields.io/badge/Community%20Sponsor-Your%20name%20here-374151?style=for-the-badge" />
95+
</a>
96+
<br />
97+
Acknowledgement and support
98+
</td>
99+
</tr>
100+
</table>
101+
102+
A dedicated sponsor kit with tiers, logo guidelines, and company contact details can be added as the sponsorship program evolves.
103+
104+
## Packages
105+
106+
- [`packages/web`](./packages/web/README.md) — Preact + Vite frontend for search and capture exploration
107+
- [`packages/api`](./packages/api/README.md) — Cloudflare Worker proxy and OpenAPI source
108+
109+
## Architecture
110+
111+
```text
112+
Browser UI (packages/web)
113+
-> API proxy (packages/api)
114+
-> index.commoncrawl.org
115+
```
116+
117+
The web app consumes generated API clients based on the Worker's exported OpenAPI spec.
118+
That keeps the frontend and proxy contract aligned.
119+
120+
## Quick start
121+
122+
### 1) Install dependencies
123+
124+
```bash
125+
pnpm install
126+
```
127+
128+
### 2) Configure the web app
129+
130+
```bash
131+
cp packages/web/.env.example packages/web/.env
132+
```
133+
134+
### 3) Start the API
135+
136+
```bash
137+
pnpm --filter @commoncrawl.cc/api dev
138+
```
139+
140+
### 4) Start the web app
141+
142+
```bash
143+
pnpm --filter @commoncrawl.cc/web dev
144+
```
145+
146+
Then open:
147+
148+
- http://localhost:3000
149+
150+
The web app expects the API at `http://localhost:8787` by default.
151+
152+
## Development
153+
154+
### Build
155+
156+
```bash
157+
pnpm --filter @commoncrawl.cc/api build
158+
pnpm --filter @commoncrawl.cc/web build
159+
```
160+
161+
### Test
162+
163+
```bash
164+
pnpm --filter @commoncrawl.cc/web test
165+
```
166+
167+
### Lint and format
168+
169+
```bash
170+
pnpm lint
171+
pnpm fmt:check
172+
```
173+
174+
### Sync OpenAPI artifacts
175+
176+
```bash
177+
pnpm openapi:sync
178+
```
179+
180+
This exports the API OpenAPI spec and regenerates the typed web client.
181+
182+
## Tech stack
183+
184+
- Preact
185+
- Vite
186+
- preact-iso
187+
- Hono
188+
- Cloudflare Workers
189+
- Cloudflare Pages
190+
- Orval
191+
- MSW
192+
- pnpm workspace
193+
194+
## Contributing
195+
196+
Issues and pull requests are welcome.
197+
If you find rough edges in the search workflow, timeline view, replay behavior,
198+
or API contract, feedback is especially valuable.
199+
200+
## License
201+
202+
[MIT](./LICENSE)
1 MB
Loading

0 commit comments

Comments
 (0)