A versatile Python toolkit for analyzing multiple data sources (chat messages and documentation) using vector databases and comparative LLM evaluation.
This toolkit allows you to:
- Build vector databases from chat message data (CSV format)
- Add documentation (Markdown/MDX) to the vector database
- Query the vector database with specialized prompts
- Compare results from multiple LLMs (OpenAI, Anthropic, and Google Gemini)
- Generate comprehensive analyses and insights
- Multiple Data Sources: Combine chat messages and documentation in a unified vector database
- Markdown to CSV Conversion: Convert chat data from markdown format to CSV for processing
- Comparative LLM Analysis: Query three leading LLMs simultaneously for diverse perspectives
- Specialized Prompts: Use domain-specific prompts for targeted insights
- Visualization: Generate side-by-side comparisons of LLM outputs
- Batch Processing: Run multiple analyses in sequence for comprehensive results
- Simplified Interface: Use the all-in-one script for streamlined workflow
ChatDoc-InsightMiner-PromptLab/
├── README.md
├── QUICKSTART.md
├── CONTRIBUTING.md # Contribution guidelines
├── CHANGELOG.md # Version history and changes
├── requirements.txt
├── .env.example # Template for API keys and settings
├── scripts/
│ ├── toolkit.py # All-in-one script for all functionality
│ ├── setup.py # Environment setup and verification
│ ├── build_vector_db.py # Build vector DB from chat data
│ ├── add_docs_to_vector_db.py # Add documentation to vector DB
│ ├── multi_llm_combined_analyzer.py # Query sources with multiple LLMs
│ ├── md_to_csv_converter.py # Convert markdown files to CSV format
│ └── run_demo.py # Run a complete demo workflow
├── data/
│ └── chat_data.csv # Sample chat data
├── docs/ # Documentation directory
│ └── monitoring-dashboard.md # Sample documentation
├── prompts/ # Prompt templates directory
│ ├── analysis_prompts/ # Analysis-focused prompts
│ │ ├── documentation_gaps.txt
│ │ ├── feature_demand_analysis.txt
│ │ ├── sentiment_analysis.txt
│ │ ├── technical_issues.txt
│ │ └── user_journey_analysis.txt
│ ├── creative_discovery_prompts/ # Creative analysis prompts
│ │ ├── community_dynamics.txt
│ │ ├── competitive_intelligence.txt
│ │ ├── future_prediction.txt
│ │ ├── knowledge_graph_extraction.txt
│ │ ├── linguistic_patterns.txt
│ │ ├── technical_issues_map.txt
│ │ └── user_persona_development.txt
│ ├── faq_prompts/ # FAQ generation prompts
│ │ ├── blockfrost_icebreakers_faq_enhancement_prompt.txt
│ │ └── faq_creator.txt
│ └── prompt_for_specific_question/ # Targeted question prompts
│ └── identify_node.txt
├── tests/ # Test suite
│ ├── run_tests.py # Test runner
│ ├── test_md_to_csv_converter.py # Unit tests for conversion
│ └── test_toolkit_md2csv.py # Integration tests for CLI
├── outputs/ # Analysis outputs directory
├── vector_db/ # Vector database storage
└── logs/ # Log files directory
The toolkit comes with a variety of pre-built prompts for different analytical purposes:
General-purpose analytical prompts for extracting insights:
- Technical Issues Analysis: Identify common technical problems and solutions
- User Journey Analysis: Map the typical user experience and pain points
- Documentation Gaps: Find missing or incomplete documentation areas
- Feature Demand Analysis: Discover the most requested features
- Sentiment Analysis: Gauge user sentiment around specific topics
Advanced prompts for deeper insights and creative exploration:
- Knowledge Graph Extraction: Build a comprehensive domain knowledge map
- User Persona Development: Create detailed user personas from conversations
- Community Dynamics: Analyze community interactions and relationships
- Future Prediction: Forecast upcoming trends and potential issues
- Competitive Intelligence: Extract information about competitors
Tools for creating helpful documentation:
- FAQ Creator: Generate comprehensive FAQ documents from conversations
- FAQ Enhancement: Improve existing FAQs with new content
Targeted prompts for answering specific user questions:
- Identify Node: Step-by-step guide for identifying nodes in monitoring dashboards
- Python 3.9+
- API keys for OpenAI, Anthropic, and Google Gemini (for multi-LLM features)
- Basic knowledge of command-line usage
-
Clone this repository:
git clone https://github.com/joseph-fajen/ChatDoc-InsightMiner-PromptLab.git cd ChatDoc-InsightMiner-PromptLab -
Install requirements:
pip install -r requirements.txt
-
Create a
.envfile from the template:cp .env.example .env
-
Add your API keys and customize settings in the
.envfile:# Edit the .env file with your preferred text editor # Add at least one API key (OpenAI, Anthropic, or Google Gemini)
The easiest way to get started is to use the interactive wizard:
python scripts/toolkit.py wizard
The wizard will guide you through:
- Verifying your environment
- Setting up API keys
- Building the vector database
- Adding documentation
- Running your first analysis
Use the all-in-one script for a simplified workflow:
# Set up the environment
python scripts/toolkit.py setup
# Build the vector database
python scripts/toolkit.py build
# Add documentation to the vector database
python scripts/toolkit.py docs
# Run analysis with all LLMs
python scripts/toolkit.py analyze --prompt prompts/analysis_prompts/technical_issues.txt
# Run with a single LLM (if you don't have all API keys)
python scripts/toolkit.py fallback --prompt prompts/analysis_prompts/technical_issues.txt --provider openai
# Convert markdown file(s) to CSV
python scripts/toolkit.py md2csv --input sample-markdown-file.md --output data/output.csv
# Convert multiple markdown files with source tracking
python scripts/toolkit.py md2csv --input file1.md file2.md --output data/combined.csv --track-source
# Run batch analysis on all prompts in a directory
python scripts/toolkit.py analyze --batch --prompts-dir prompts/analysis_prompts
# Run the complete demo
python scripts/toolkit.py demoYou can also use the individual scripts:
- Prepare your chat data in CSV format with columns: timestamp, username, message
- Add documentation files in Markdown format to the docs directory
- Build the vector database:
python scripts/build_vector_db.py - Add documentation to the vector database:
python scripts/add_docs_to_vector_db.py - Run a multi-LLM analysis with a prompt:
python scripts/multi_llm_combined_analyzer.py --prompt prompts/analysis_prompts/technical_issues.txt
You can create your own custom prompts by adding text files to the appropriate directories:
- Create a new
.txtfile in one of the prompt directories - Structure your prompt with clear instructions for the LLMs
- Use the
{conversations}placeholder to reference retrieved content - Run your analysis:
python scripts/toolkit.py analyze --prompt path/to/your/prompt.txt
The toolkit works best with all three LLM providers (OpenAI, Anthropic, and Google Gemini). However, you can still use it with just one provider by using the fallback mode:
python scripts/toolkit.py fallback --prompt prompts/your_prompt.txt --provider [openai|anthropic|gemini]Choose the provider for which you have an API key. Each provider has different capabilities and limitations:
python scripts/toolkit.py fallback --prompt prompts/your_prompt.txt --provider openai- Uses the model specified by
OPENAI_MODELin your.envfile (defaults to gpt-4o) - Good for general analysis and documentation generation
python scripts/toolkit.py fallback --prompt prompts/your_prompt.txt --provider anthropic- Uses the model specified by
ANTHROPIC_MODELin your.envfile (defaults to claude-3-opus-20240229) - Excels at detailed analysis and complex reasoning
python scripts/toolkit.py fallback --prompt prompts/your_prompt.txt --provider gemini- Uses the model specified by
GEMINI_MODELin your.envfile (defaults to gemini-1.5-pro) - Good for creative content and technical documentation
You will still need to set up the vector database with your chat data and documentation even when using fallback mode.
If you encounter issues:
-
Run the setup script to verify your environment:
python scripts/setup.py -
Check the log files in the
logs/directory for detailed error messages -
Ensure your API keys are correctly set in the
.envfile -
Verify that your input data (chat data and documentation) is in the correct format
The toolkit includes a test suite to validate its functionality:
# Run all tests
python tests/run_tests.py
# Run a specific test file
python tests/test_md_to_csv_converter.py
# Run a specific test case
python -m unittest tests.test_md_to_csv_converter.TestMarkdownToCsvConverter.test_single_file_conversionThe test suite includes:
- Unit tests for the markdown to CSV conversion logic
- Integration tests for the toolkit command line interface
This project is licensed under the Apache License 2.0 - see the LICENSE file for details. The Apache License 2.0 allows you to freely use, modify, distribute, and sublicense this code, while providing an express grant of patent rights from contributors to users.
Contributions are welcome! Please read the CONTRIBUTING.md file for guidelines on how to contribute to this project. We maintain a CHANGELOG.md to track versions and document changes.