Skip to content

Latest commit

 

History

History
124 lines (93 loc) · 4.17 KB

File metadata and controls

124 lines (93 loc) · 4.17 KB

GitHub Repository Summarizer API

This is a Flask-based REST API application that leverages Large Language Models (LLMs) via LangChain to automatically analyze and summarize any public GitHub repository.

The application scans the project structure, reads the README.md, uses AI to select the most important source code files, and generates a structured JSON response describing the project, its technology stack, and its architecture based on those files.

Features

  • Smart file selection: The LLM analyzes the file tree and selects up to 6 key files to understand the project's logic, ignoring binaries and clutter.
  • Deep analysis: Reads the contents of the selected files (with a size limit to avoid exceeding the context window) and forms an accurate description.
  • Structured output: Uses LangChain's JsonOutputParser and Pydantic to guarantee a valid JSON response.
  • Logging: Detailed execution logs (including LLM prompts and responses) are saved to app.log.

Requirements

  • Python 3.9+
  • GitHub Personal Access Token (to bypass API rate limits)
  • LLM Provider API Key (configured for Nebius API with Llama 3.1 / Qwen models in this setup)

Installation and Setup

  1. Clone the repository (or create a project folder):
   git clone 
   cd 
  1. Create and activate a virtual environment:
   python -m venv venv
   source venv/bin/activate  # On Windows use: venv\Scripts\activate
  1. Install dependencies:
   pip install flask requests python-dotenv langchain-openai pydantic langchain-core gunicorn
  1. Configure environment variables: Create a .env file in the root of the project and add your keys:
   NEBIUS_API_KEY=your_nebius_api_key
   GITHUB_TOKEN=your_github_token
  1. Start the server with Gunicorn:
   gunicorn app:app -w 4 -b 0.0.0.0:8000

The server will start at http://0.0.0.0:8000.

Gunicorn parameters:

  • -w 4 — number of worker processes (adjust based on CPU cores)
  • -b 0.0.0.0:8000 — bind to all interfaces on port 8000
  • app:app — module name and Flask app object

API Documentation

1. Get Repository Summary

Endpoint: POST /summarize

Accepts a GitHub repository URL, fetches the repository contents, and returns a summary generated by an LLM.

Request body:

{
  "github_url": "https://github.com/psf/requests"
}
Field Type Required Description
github_url string Yes URL of a public GitHub repository

Response (200 OK):

{
  "summary": "Requests is a popular Python library for making HTTP requests...",
  "technologies": ["Python", "urllib3", "certifi"],
  "structure": "The project follows a standard Python package layout with the main source code in src/requests/, tests in tests/, and documentation in docs/."
}
Field Type Description
summary string A human-readable description of what the project does
technologies string[] List of main technologies, languages, and frameworks used
structure string Brief description of the project structure and organization

Error Response:

{
  "status": "error",
  "message": "Description of what went wrong"
}

Possible Errors:

  • 400 Bad Request: Missing URL or invalid URL format.
  • 502 Bad Gateway: Error communicating with the GitHub API (e.g., repository not found).
  • 500 Internal Server Error: Error generating or parsing the LLM response.

2. Server Health Check

Endpoint: GET /health

Returns the server status. Used for monitoring.

Response:

{
  "status": "ok"
}

How it works under the hood (Architecture)

  1. analyze_repo_structure: Fetches the repository file tree (up to depth 2) and the README.md. Sends them to the LLM with a prompt to return a JSON array of up to 6 most important files (configs, entry points, core logic).
  2. generate_repo_summary: Downloads the source code of those 6 files via the GitHub API. Then, all the context (Tree + README + Code) is sent in a second LLM call with a strict Pydantic response schema (RepoSummarySchema) to form the final result.