Star 🌟 CocoIndex if you like it!!
This example shows how to use instructor with Gemini to analyze multiple Python codebases and generate markdown documentation using CocoIndex v1.
- Scans subdirectories of a root directory (each expected to be a separate Python project)
- Per-file extraction using LLM with a unified
CodebaseInfomodel:- Public classes and functions with functionality summaries
- CocoIndex app call relationship graphs (Mermaid format)
- File-level summaries
- Project aggregation - combines file-level
CodebaseInfointo a project-level summary - Outputs markdown documentation to
output/PROJECT_NAME.md
- Instructor Integration: Uses instructor library for structured LLM outputs with Pydantic
- Unified Data Model: Same
CodebaseInfotype for both file-level and project-level extraction - LLM-Generated Mermaid Graphs: The LLM generates mermaid syntax directly with:
- Bold text for
@coco.fndecorated functions - Thick arrows (
==>) formount/use_mountcalls
- Bold text for
- Incremental Processing: CocoIndex handles caching - only re-processes changed files
- Multi-Project Support: Processes multiple codebases in parallel
The generated markdown includes:
- Overview - High-level project description
- Components - Classes and functions with summaries
- CocoIndex Pipeline - Mermaid diagrams (if CocoIndex is used)
- File Details - Per-file summaries (for multi-file projects)
graph TD
%% App: SampleApp
app_main[<b>app_main</b>] ==> process_file[<b>process_file</b>]
process_file --> helper_func[helper_func]
Bold = @coco.fn, thick arrows (==>) = mount/use_mount calls
pip install -e .Create a .env file in the example directory:
echo "GEMINI_API_KEY=your_api_key_here" > .envReplace your_api_key_here with your actual Gemini API key.
Optionally, set a different LLM model:
echo "LLM_MODEL=gemini/gemini-2.5-flash" >> .envCreate a projects/ directory with subdirectories for each Python project:
projects/
├── my_project_1/
│ ├── main.py
│ └── utils.py
├── my_project_2/
│ └── app.py
└── ...
cocoindex update main.pyThis will:
- Scan all subdirectories in
projects/ - Extract information from all
.pyfiles (excluding.venv*directories) - Generate markdown documentation in
output/
ls -la output/
cat output/my_project_1.mdEdit the app definition in main.py:
app = coco.App(
app_main,
coco.AppConfig(name="MultiCodebaseSummarization"),
root_dir=pathlib.Path("./your_projects_dir"),
output_dir=pathlib.Path("./your_output_dir"),
)Set the LLM_MODEL environment variable to any LiteLLM-supported model:
# OpenAI
export LLM_MODEL=gpt-4o
# Anthropic
export LLM_MODEL=anthropic/claude-3-5-sonnet
# Local (Ollama)
export LLM_MODEL=ollama/llama3.2graph TD
%% App: MultiCodebaseSummarization
app_main[<b>app_main</b>] ==> process_project[<b>process_project</b>]
process_project ==> extract_file_info[<b>extract_file_info</b>]
process_project ==> aggregate_project_info[<b>aggregate_project_info</b>]
process_project --> generate_markdown[generate_markdown]
- app_main: Lists subdirectories, sets up output target, mounts
process_projectfor each - process_project: Extracts info from each file, aggregates, outputs markdown
- extract_file_info: Uses instructor + LLM to extract
CodebaseInfofrom each file - aggregate_project_info: Combines file
CodebaseInfointo project-levelCodebaseInfo - generate_markdown: Converts
CodebaseInfoto markdown and callsdeclare_file