A web search engine built with FastAPI. Koala allows you to crawl websites, index their content, and perform semantic search with relevance scoring.
- Semantic Search: Advanced search using sentence transformers for better relevance
- Web Crawling: Automated website crawling with configurable depth and page limits
- Real-time Analytics: Search statistics and popular query tracking
- RESTful API: Full-featured API for integration with other applications
- Background Processing: Non-blocking website crawling with job status tracking
- Python 3.10 or higher
- UV (recommended)
-
Clone the repository
git clone https://github.com/Abhinavexists/Koala.git cd koala -
Install backend dependencies:
cd backend python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install -r requirements.txtRecommended: Install with
uv(faster, more reliable):cd backend uv venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate uv pip install -e . -
Start the server
python static_server.py
-
Start the search api
python search_api.py
-
Open in browser Navigate to
http://localhost:8000
- Go to the Websites tab
- Fill in the website details:
- URL: The website to crawl (e.g.,
https://example.com) - Name: Display name for the website
- Description: Optional description
- Max Pages: Maximum number of pages to crawl (default: 50)
- Max Depth: How deep to crawl from the starting URL (default: 2)
- URL: The website to crawl (e.g.,
- Click Add Website
- The system will start crawling in the background
- Go to the Search tab
- Enter your search query in the search box
- Press Enter or click the Search button
- View results with relevance scores and snippets
- Use pagination to browse through results
- Go to the Analytics tab to view:
- Total number of searches performed
- Number of websites indexed
- Active crawling jobs
- Popular search queries
GET /api/search- Perform search queriesGET /api/websites- List all websitesPOST /api/websites- Add new website to crawlDELETE /api/websites/{id}- Remove websitePOST /api/websites/{id}/recrawl- Recrawl websiteGET /api/popular- Get popular search queriesGET /api/stats- Get system statistics
# Search for content
curl "http://localhost:8080/api/search?q=python&page=1&per_page=10"
# Add a website
curl -X POST "http://localhost:8080/api/websites" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"name": "Example Site",
"max_pages": 100,
"max_depth": 3
}'- search_api.py: Main API endpoints and business logic
- search_engine.py: Semantic search implementation using sentence transformers
- crawler.py: Web crawling functionality with BeautifulSoup
- static_server.py: Combined static file server and API gateway
- prepared_data.json: Indexed website content and metadata
- websites.json: Website configuration and crawl status
- Vector Index: In-memory semantic search index using FAISS
HOST: Server host (default: 0.0.0.0)PORT: Server port (default: 8080)
- Model: Uses
all-MiniLM-L6-v2sentence transformer model - Device: Automatically detects CUDA/CPU
- Index Type: FAISS for efficient similarity search
koala/
├── frontend/ # Frontend assets
│ ├── css/ # Stylesheets
│ ├── js/ # JavaScript modules
│ └── index.html # Main HTML file
├── search_api.py # API endpoints
├── search_engine.py # Search implementation
├── crawler.py # Web crawler
├── static_server.py # Server entry point
├── pyproject.toml # project info
└── requirements.txt # Python dependencies
-
Backend only (API on port 8000):
python search_api.py
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
(Just a fun practice project to gain an understanding of Redis, crawlers, and semantic search)