Unifero-CLI is a compact Python toolkit that brings web-search and documentation crawling into a single, easy to use tool. It focuses on safely extracting technical content and code snippets from result pages or documentation sites. The project provides:
- a modern CLI (
main.py), - a FastAPI wrapper (
api.py) for HTTP-based automation and testing, and - a Python class interface (
tools.unifero.UniferoTool) for direct programmatic use.
- Features
- Installation
- Quick examples (CLI and API)
- Inputs & Outputs (examples)
- Edge cases, limitations & behavior
- Error handling and retry policy
- Troubleshooting
- Development & tests
- Project structure
- Search mode (DuckDuckGo) with result content extraction.
- Docs mode: crawl a base documentation URL and gather pages + code blocks.
- Code-aware extraction: preserves
<pre>/<code>blocks and returns them as fenced Markdown blocks in the output. - Multiple interfaces: CLI, HTTP API, and programmatic use.
- Networking robustness: connection retries, timeouts and basic backoff for transient failures.
- Output options: pretty JSON, compact JSON, and writing to a file.
Requirements:
- Python 3.8+ (recommended)
- A virtual environment is strongly recommended
Install and set up:
cd /path/to/unifero-cli
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtTip: on macOS and Linux use source .venv/bin/activate. For zsh this is the same. Pick the .venv interpreter for your editor (VS Code) to avoid "import not found" warnings.
CLI: run a quick search
source .venv/bin/activate
python3 main.py --search "Python FastAPI" --limit 3CLI: crawl docs and save to file
python3 main.py --docs "https://ai-sdk.dev/docs/ai-sdk-ui/chatbot" --output docs_result.jsonStart the API server (development):
source .venv/bin/activate
uvicorn api:app --reloadHTTP example (POST body JSON):
{
"mode": "docs",
"url": "https://ai-sdk.dev/docs/ai-sdk-ui/chatbot",
"limit": 2,
"include_content": true
}You can POST this to http://127.0.0.1:8000/process and receive the same structure the CLI prints.
- Search mode input (CLI):
python3 main.py --search "Next.js routing" --limit 2Search mode JSON output (truncated):
{
"mode": "search",
"query": "Next.js routing",
"results": [
{
"title": "Next.js — Routing",
"url": "https://nextjs.org/docs/routing",
"snippet": "...routing basics...",
"content": "# Page title\nSome intro text\n```js\n// code block captured from the page\n```"
}
]
}- Docs mode input (HTTP body):
{
"mode": "docs",
"url": "https://ai-sdk.dev/docs/ai-sdk-ui/chatbot",
"limit": 3,
"include_content": true
}Docs mode JSON output (truncated):
{
"mode": "docs",
"base_url": "https://ai-sdk.dev/docs/ai-sdk-ui/chatbot",
"results": [
{
"url": "https://ai-sdk.dev/docs/ai-sdk-ui/chatbot",
"title": "AI SDK UI: Chatbot",
"content": "# AI SDK UI: Chatbot\nSome description...\n```js\nconst chat = useChat(...);\n```",
"fetched": true
},
{
"url": "https://ai-sdk.dev/docs/ai-sdk-ui/usage",
"title": "Usage",
"content": "...",
"fetched": true
}
]
}Notes on output fields:
results: list of pages (search results or crawled docs pages).- Each result includes
url,title,snippet(search mode), andcontentwheninclude_contentis true.contentis a Markdown-ready string with fenced code blocks for extracted code. fetched: (docs mode) boolean indicating whether the page content was successfully fetched and parsed. If false, theerrorfield may provide a short message.
You may also see several additional fields in the CLI/API outputs and test artifacts (for example in output.txt). These are emitted by the runner/test harness and the core tool to help clients and debugging tools interpret results:
favicon(per-result): URL to the site's favicon, when available. Useful for UI lists where a compact site icon is shown.og_image(per-result): URL of the Open Graph image (og:image) if discovered in the page metadata.base_url(top-level fordocs): the exact base URL you requested for docs mode. The tool ensures the requested base URL appears in results even if the crawler doesn't discover it.status_code(wrapper): the HTTP-like status code returned by the wrapper (e.g., 200 for success, 400 for invalid input). This is not the target site's HTTP code but the wrapper's response code.nameandrequest(wrapper): the test-runner or wrapper may produce anamelabel for the run and echo therequestpayload so you can trace which input produced the output.response(wrapper): when present, this contains the same structured object that the CLI/API returns (theresultsarray, etc.).elapsed(wrapper): number of seconds the operation took. Useful for performance logging.attempts(wrapper): how many network/operation attempts were made (useful if retries occured).
Example (truncated from a test runner):
{
"name": "search_minimal",
"request": {"mode":"search","query":"Next.js routing"},
"status_code": 200,
"response": { "query":"Next.js routing", "results": [ ... ] },
"elapsed": 1.58,
"attempts": 1
}How to interpret these fields:
- The
responseobject is the canonical output your client should consume. Wrapper-level metadata (name,status_code,elapsed,attempts,request) are intended for test harnesses, logging, or UI telemetry. - Per-result
faviconandog_imageare optional and may be null when the page doesn't declare them or the fetch/parsing failed. - When
fetchedis false for a result you can checkerrorfor a short message; wrapper metadata still helps diagnose network timeouts or retry behavior.
- Single-page docs sites (SPA) and client-rendered content:
- The tool fetches server-side HTML only. If a docs site is heavily client-side rendered (content injected via JavaScript), the tool will likely only see the initial shell and will miss the dynamically rendered content. Use the
fetched: false/errorsignals to detect this.
- Robots/toS and politeness:
- This tool does not implement robots.txt parsing or aggressive rate-limiting. It's intended for small-scale testing. For production crawling, add robots parsing, proper rate limits, and caching.
- Rate limits and blocking:
- Repeated automated requests to the same host may trigger rate-limiting or blocking. The tool uses a short retry/backoff for transient HTTP failures, but it's not stealthy: respect the target site's policies.
- Duplicate or noisy content:
- Some pages (headers, footers, menus) contain repeated content; the tool attempts to focus on main
<article>or visible containers but may return noise on poorly structured pages.
- Redirects and base URL normalization:
docsmode always includes the exactbase_urlrequested as the first result (even if it wasn't discovered by the internal crawler). Redirects are followed by the HTTP client;resultswill contain the final fetched URL.
- Maximum crawl size:
- To avoid runaway crawls,
limitis capped (default 5, enforced max 10). If you need larger crawls, modify the code carefully and add rate-limiting.
Overview:
- Network calls use a session with retries for transient errors (connection resets, 5xx responses). The retry policy has a small backoff and a limited number of retries.
- Timeouts are applied to HTTP requests. If a request times out, the page is marked with
fetched: falseand anerrormessage.
Common error fields returned in docs results (per page):
fetched: boolean (true when parsing succeeded)error: short string describing the failure (network error, timeout, parse failure)
Examples:
- When a page times out:
{
"url": "https://example.com/slow",
"fetched": false,
"error": "timeout after 10s"
}- When a page is client-rendered and contains little server HTML:
{
"url": "https://spa.example/docs",
"fetched": false,
"error": "no usable content found - page may be client-rendered"
}How the CLI/API surfaces errors:
- CLI prints a non-zero exit code when the top-level operation fails (for example, missing required arguments, invalid JSON input).
- For per-page failures, the operation still returns a 200 OK with the
resultslist containingfetched:falseentries; this allows clients to inspect partial success.
- "import fastapi could not be resolved": make sure you selected the
.venvinterpreter in your editor and ranpip install -r requirements.txtinside the venv. - If
pytestcannot import local modules, setPYTHONPATH=.before calling pytest (or install the package into the venv). - If extracted
contentlacks code blocks you expected, the page is likely client-rendered. Consider using a headless browser approach (not included) or point the tool at a direct source page that serves server-side HTML.
Run unit tests:
source .venv/bin/activate
PYTHONPATH=. pytest -qRun the API integration script (requires the server to be running):
uvicorn api:app --reload
python3 scripts/test_api.pyunifero-cli/
├── assets/ # small assets (logo.svg)
├── main.py # CLI entrypoint
├── api.py # FastAPI wrapper
├── requirements.txt # dependencies
├── tools/
│ ├── __init__.py
│ └── unifero.py # core logic
├── tests/
│ └── test_main.py
└── scripts/
└── test_api.py
Contributions welcome. Please include tests for bug fixes or new features. Keep UniferoTool.process_request contract stable if you rely on it from the CLI or API.
MIT-style (open source). Use respectfully and add tests for changes.
A powerful CLI toolkit for web searches and documentation crawling with enhanced code extraction capabilities.
- Smart Web Search: DuckDuckGo-based search with content extraction from result pages
- Documentation Crawling: Crawl documentation sites and extract structured content
- Code Extraction: Enhanced HTML parsing specifically designed to capture code snippets and technical content
- Multiple Interfaces: Modern CLI, legacy JSON input, REST API, and Python library
- Robust Networking: Built-in retries, timeout handling, and error recovery
- Flexible Output: Pretty JSON, compact JSON, or file output
Requirements:
- Python 3.8+
- Virtual environment recommended
Setup:
# Clone or download the project
cd unifero-cli
# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Activate environment
source .venv/bin/activate
# Quick search
python3 main.py --search "Next.js routing"
# Documentation crawl with code extraction
python3 main.py --docs "https://ai-sdk.dev/docs/ai-sdk-ui/chatbot"
# Save results to file
python3 main.py --search "Python FastAPI" --output results.json
# Show all examples
python3 main.py --examplesThe enhanced CLI supports intuitive command-line arguments:
Search mode:
# Basic search
python3 main.py --search "Next.js routing"
# Advanced search with options
python3 main.py --search "React hooks" --limit 5 --snippet-len 200 --content-len 3000
# Compact output
python3 main.py --search "Python FastAPI" --compact
# Save to file
python3 main.py --search "Vue.js components" --output search_results.jsonDocs mode:
# Basic docs crawl
python3 main.py --docs "https://ai-sdk.dev/docs/ai-sdk-ui/chatbot"
# Advanced docs with options
python3 main.py --docs "https://nextjs.org/docs" --limit 3 --content-limit 2000
# Docs without content (URLs only)
python3 main.py --docs "https://example.com/docs" --no-contentHelp and examples:
# Show help
python3 main.py --help
# Show all examples
python3 main.py --examplesFor backward compatibility, JSON input is still supported:
# JSON as argument
python3 main.py '{"mode":"search","query":"Next.js routing","limit":3}'
# JSON via environment variable
export UNIFERO_JSON='{"mode":"docs","url":"https://example.com/docs"}'
python3 main.py
# JSON via pipe
echo '{"mode":"search","query":"test"}' | python3 main.pyUse the UniferoTool class directly from Python code:
from tools.unifero import UniferoTool
tool = UniferoTool()
resp = tool.process_request({
"mode": "search",
"query": "Next.js routing",
"limit": 2
})
print(resp)The process_request method accepts a dict with these keys:
- mode:
search(default) ordocs - query: search query (required for
search) - limit: maximum number of results
- url: base url for
docsmode - include_content: whether to fetch page content for docs
Performs DuckDuckGo search and extracts content from result pages.
Parameters:
query(required): Search query stringlimit: Maximum number of results (default: 5)snippet_len: Maximum snippet length (default: 300)content_len: Maximum content length (default: 2000)
Crawls documentation sites and extracts structured content with code blocks.
Parameters:
url(required): Base documentation URLlimit: Maximum pages to crawl (default: 5, max: 10)include_content: Whether to fetch page content (default: true)content_limit: Maximum content length per page (default: 2000)
source .venv/bin/activate
PYTHONPATH=. pytest -qA comprehensive test suite is available for the FastAPI server:
# Start the API server
uvicorn api:app --reload
# In another terminal, run the test suite
python3 scripts/test_api.pyThis project includes configuration for easy deployment to Vercel as a serverless FastAPI application.
- A Vercel account
- Vercel CLI installed:
npm i -g vercel - Your project pushed to a Git repository (GitHub, GitLab, or Bitbucket)
-
Test locally first:
source .venv/bin/activate ./deploy.shThis will start a local development server at
http://localhost:8000 -
Deploy to Vercel:
# Login to Vercel (one time setup) vercel login # Deploy from your project directory vercel # For production deployment vercel --prod
The following files configure Vercel deployment:
vercel.json: Main Vercel configurationruntime.txt: Specifies Python version (3.11).vercelignore: Files to exclude from deploymentrequirements.txt: Python dependencies
Once deployed, your API will have these endpoints:
GET /health: Health check endpointPOST /process: Main API endpoint for processing requestsGET /docs: FastAPI auto-generated documentationGET /redoc: Alternative API documentation
After deployment, you can use your API like this:
# Health check
curl https://your-app.vercel.app/health
# Search request
curl -X POST https://your-app.vercel.app/process \
-H "Content-Type: application/json" \
-d '{"mode":"search","query":"Next.js routing","limit":3}'
# Docs request
curl -X POST https://your-app.vercel.app/process \
-H "Content-Type: application/json" \
-d '{"mode":"docs","url":"https://nextjs.org/docs","limit":2}'If your application needs environment variables, you can set them in the Vercel dashboard or via CLI:
vercel env add VARIABLE_NAME- Import errors: Make sure all dependencies are in
requirements.txt - Timeout issues: Vercel has a 10-second timeout for serverless functions
- Memory issues: Consider reducing content limits for large documents
- Module not found: Ensure proper Python path structure
Before deploying, always test locally:
# Start local development server
./deploy.sh
# Test endpoints
curl http://localhost:8000/health
curl -X POST http://localhost:8000/process \
-H "Content-Type: application/json" \
-d '{"mode":"search","query":"test","limit":1}'unifero-cli/
├── main.py # Enhanced CLI interface
├── api.py # FastAPI server wrapper
├── requirements.txt # Python dependencies
├── tools/
│ ├── __init__.py # Package initialization
│ └── unifero.py # Core extraction logic
├── tests/
│ └── test_main.py # Unit tests
└── scripts/
└── test_api.py # API integration tests
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Ensure all tests pass:
PYTHONPATH=. pytest - Submit a pull request
Open source - contributions welcome. Keep changes focused and add tests for new functionality.
