This document details the current architecture and functionality of the RepoScanner application, including recent database persistence enhancements, and outlines a strategic path towards refactoring it into a more robust, flexible, and intelligent system using LangGraph and agentic LLM tools.
The RepoScanner application is designed to analyze GitHub repositories for compliance with the EU AI Act. Its core logic resides primarily within app/scanner.py, orchestrated by the scan_repo function, with recent enhancements for database persistence and API functionality.
A. Core Workflow (app.scanner.scan_repo)
The scan_repo function executes a sequential pipeline:
-
Input Processing:
- Accepts a
RepoInputModelwhich can contain either a direct repository URL or structured details (owner, repo, branch). - Uses
resolve_repo_inputto ensureowner,repo, andbranchare determined.
- Accepts a
-
Repository Acquisition:
get_repo_archive_info: Fetches repository metadata, including the ZIP archive URL and commit SHA.download_repo_zip: Downloads the repository ZIP archive into a temporary directory.unzip_archive: Extracts the archive contents. A temporary directory (created viatempfile.mkdtemp) is used and cleaned up afterwards (shutil.rmtree).
-
File Discovery & Content Caching:
find_documentation_files: Scans the unzipped repository for Markdown (*.md) and OpenAPI (*.yaml,*.json) files.find_code_files: Scans for Python (*.py), JavaScript (*.js), and TypeScript (*.ts) files.- Content Caching: File contents are read once using
open(..., errors='ignore')and stored in afile_content_cachedictionary to avoid redundant reads and manage potential recursion issues.
-
Documentation Analysis:
- Markdown: For each Markdown file,
extract_text_and_headings_from_markdownextracts its textual content and major headings. - OpenAPI: For each OpenAPI file,
parse_openapi_fileparses its content (usingyaml.safe_load). - LLM Summarization: The collected text from Markdown and OpenAPI specs is passed to
llm_service.summarize_documentation(an instance ofLLMServicefromapp.services.llm_service.py) to generate a concise summary of the repository's purpose, data, and users.
- Markdown: For each Markdown file,
-
Code Analysis:
- Python: For each Python file,
analyze_python_code_astperforms static analysis using theastmodule to detect specific patterns (e.g., imports indicative of General Purpose AI usage). - JavaScript/TypeScript:
analyze_js_ts_code_astis a placeholder for future JS/TS AST analysis. - Grep-based Signals:
run_grep_searchexecutesgrepcommands on the codebase to find keywords related to biometric data, real-time processing, etc., contributing toCodeSignalflags.
- Python: For each Python file,
-
Vector Processing & Semantic Search (
app.vector_processing.py):- Obligation Embedding:
upsert_obligation_documents(called once, typically on application startup or when obligations change) processes theeu_ai_act_obligations.yamlfile. It uses Langchain and an OpenAI embedding model (e.g.,text-embedding-ada-002, configured viaapp.config.Settings) to create embeddings for each obligation and stores them in a ChromaDB collection (CHROMA_PERSIST_PATH,CHROMA_COLLECTION_NAME). - Repository Document Embedding:
upsert_repository_documentstakesRepositoryFileobjects (containing path, content, and file type like "doc", "code", "openapi") for all relevant files from the scanned repository. It chunks these documents, generates embeddings, and upserts them into the same ChromaDB collection. - Fuzzy Matching: For each processed repository document (Markdown, OpenAPI, Python code),
find_matching_obligations_for_repo_docqueries ChromaDB to find semantically similar EU AI Act obligations. This producesFuzzyMatchResultobjects.
- Obligation Embedding:
-
Risk Tier Determination:
determine_risk_tier: This function (currently a simplified placeholder) is intended to use the document summary, code signals, and potentially user inputs to assign a risk tier (prohibited,high,limited,minimal). It can fall back tollm_service.classify_risk_with_llmif deterministic rules are insufficient.
-
Checklist Generation:
load_and_get_checklist_for_tier: Based on the determined risk tier, this function loads the relevant obligations fromeu_ai_act_obligations.yamlto form the compliance checklist.
-
Output & Persistence:
- The process culminates in a
ScanResultModel(defined inapp.models.py) containing the determined risk tier, the compliance checklist, document summary, code signals, and any fuzzy matches found. - The scan results are now persisted to a database using the
persist_scan_data_nodein the LangGraph flow, which saves the data as aScanRecordin the database. - The persisted data can be retrieved via new API endpoints for scan history and record retrieval.
- The process culminates in a
B. Key Modules & Components
app/scanner.py: Orchestrates the main scanning logic.app/models.py: Defines Pydantic models for input, output, and intermediate data structures (e.g.,RepoInputModel,ScanResultModel,RepositoryFile,FuzzyMatchResult,ScanRecordResponse).app/services/llm_service.py: Handles all interactions with the OpenAI API (summarization, classification) using theopenailibrary (v1.3.0+ for async).app/vector_processing.py: Manages embedding generation (Langchain, OpenAI embeddings) and semantic search against the ChromaDB vector store.app/config.py: Usespydantic-settingsto manage application configurations (API keys, model names, ChromaDB paths, database URLs) loaded from environment variables or a.envfile.app/db/: Contains database-related modules:app/db/base_class.py: Defines the SQLAlchemy declarative base class.app/db/session.py: Manages database engine and session creation.app/db/models/scan_record.py: Defines the SQLAlchemy model for scan records.
app/crud/: Contains CRUD operations for database models:app/crud/crud_scan_record.py: Implements create and retrieve operations for scan records.
app/graph_nodes.py: Contains LangGraph nodes, including the newpersist_scan_data_nodefor database persistence.app/graph_orchestrator.py: Orchestrates the LangGraph execution flow.data/eu_ai_act_obligations.yaml: Stores the structured EU AI Act obligations, serving as the criteria store.- Testing (
tests/): Utilizespytestandpytest-mockfor comprehensive unit testing, with detailed mocking of external services (GitHub, OpenAI) and file system operations.
C. Dependencies
- Core: Python 3.11+
- Web (for API, though scanner can run independently): FastAPI
- Data Validation & Settings: Pydantic, Pydantic-Settings
- LLM Interaction:
openai(v1.3.0+ for async support) - Vector Embeddings & Storage:
langchain,langchain-openai,langchain-chroma,chromadb - HTTP Client:
httpx(for async GitHub API calls) - YAML Processing:
PyYAML - File Handling:
zipfile,tempfile,shutil - Database ORM:
SQLAlchemy(v2.0+ for async support) - Database Drivers:
asyncpg,psycopg2-binary(PostgreSQL),aiosqlite(SQLite for development) - Graph Orchestration:
langgraph
We've recently implemented significant improvements to the RepoScanner application:
A. Database Persistence
- Database Models: Created a
ScanRecordSQLAlchemy model to store scan results in a relational database (PostgreSQL for production, SQLite for development). - CRUD Operations: Implemented create and retrieve operations in
app/crud/crud_scan_record.pyfor database interactions. - Graph Integration: Added a new
persist_scan_data_nodeto the LangGraph flow to save scan results to the database. - State Management: Enhanced
ScanGraphStatewithdb_sessionandpersisted_record_idfields to manage database operations within the graph execution.
B. API Enhancements
- New Endpoints: Implemented three new API endpoints for retrieving scan history:
GET /api/v1/scan-records: Lists all scan records with pagination and filtering by risk tier.GET /api/v1/scan-history: Gets scan history for a specific repository using a query parameter.GET /api/v1/scan-records/{scan_id}: Gets a specific scan record by ID.
- Response Models: Created a
ScanRecordResponsePydantic model for consistent API responses. - Data Conversion: Added utilities to convert between SQLAlchemy models and Pydantic models.
The RepoScanner application has already begun its migration to LangGraph, with the implementation of database persistence nodes. We can continue to enhance it with more sophisticated LLM-powered "agentic tools" and additional features.
A. Vision
Transform the RepoScanner into a graph-based system where:
- Nodes represent distinct processing stages (e.g., "Fetch Code," "Analyze Python AST," "Generate Summary").
- Edges define the flow of data and control between nodes, potentially with conditional routing.
- State is a well-defined object that evolves as it passes through the graph.
- Agentic Tools are specialized functions (often LLM-powered) that nodes can invoke to perform complex sub-tasks, make decisions, or gather specific information.
B. Why LangGraph?
- Modularity & Maintainability: Each logical step becomes an independent node, making the system easier to understand, test, and modify.
- Flexibility & Control Flow: Complex workflows with conditional branching (e.g., "if high-risk signals found, invoke detailed security analysis node") become more manageable.
- State Management: LangGraph provides a robust way to manage and pass the application's state between components.
- Observability: Integration with tools like LangSmith for tracing and debugging complex chains and agent interactions.
- Resilience: Easier to implement retries and error handling for specific nodes.
- Agentic Capabilities: Provides a natural framework for building and orchestrating LLM agents that can use a suite of tools.
C. Proposed Refactoring & Enhancement Steps
-
Define the Core Graph State:
- Create a comprehensive Pydantic model (e.g.,
ScanGraphState) to hold all data that needs to be passed between nodes. This would include:- Input parameters (
RepoInputModel). - Paths (temp repo path, specific file paths).
- Discovered file manifests.
- Cached file contents.
- Extracted text, headings, parsed OpenAPI data.
- Code analysis results (AST, grep signals).
- LLM summaries and classifications.
- Embeddings status.
- Fuzzy match results.
- Determined risk tier and checklist.
- Error/status flags for various stages.
- Input parameters (
- Create a comprehensive Pydantic model (e.g.,
-
Identify and Implement Graph Nodes: Convert existing functional blocks from
scanner.pyandvector_processing.pyinto LangGraph nodes. Each node will be a function that accepts the currentScanGraphStateand returns a dictionary updating parts of the state.InitialSetupNode: Initializes state fromRepoInputModel.FetchRepositoryNode: Handlesget_repo_archive_info,download_repo_zip,unzip_archive. Updates state withtemp_repo_path.FileDiscoveryNode: Runsfind_documentation_files,find_code_files. Updates state with lists of file paths.ContentExtractionNode: Iterates through discovered files, reads content (using the caching strategy), and stores it in the state.DocumentationProcessingNode:- Processes Markdown files (
extract_text_and_headings_from_markdown). - Processes OpenAPI files (
parse_openapi_file). - Invokes
LLMService.summarize_documentation. Updates state with summaries.
- Processes Markdown files (
CodeAnalysisNode:PythonAstAnalysisSubNode: Runsanalyze_python_code_ast.GrepSignalSubNode: Runsrun_grep_search.- (Future)
JsTsAstAnalysisSubNode. Updates state withCodeSignalresults.
RepositoryEmbeddingNode: Callsupsert_repository_documentsusing content from the state.ObligationMatchingNode: Callsfind_matching_obligations_for_repo_docfor relevant processed documents. Updates state withFuzzyMatchResultlist.RiskAssessmentNode: Implementsdetermine_risk_tierlogic, potentially calling an LLM agent/tool for complex cases. Updates state withrisk_tier.ChecklistGenerationNode: Callsload_and_get_checklist_for_tier. Updates state with thechecklist.ReportCompilationNode: Assembles the finalScanResultModelfrom the state.PersistScanDataNode: ✅ Implemented. Persists scan results to the database using thecreate_scan_recordfunction.CleanupNode: Removes the temporary directory.
-
Develop Agentic Tools: Wrap specific functionalities, especially those involving LLM reasoning or complex data manipulation, into Langchain
Toolobjects. These tools can then be used by more sophisticated agentic nodes.CodeInspectorTool:- Input: Code snippet, list of inspection criteria (e.g., "check for insecure data handling," "identify PII usage").
- Action: Uses LLM (and potentially static analysis sub-tools) to analyze the code against criteria.
- Output: Analysis report, identified risks/patterns.
DocumentationQueryTool:- Input: Question about the repository's documentation (e.g., "What data sources does this system use?").
- Action: Performs semantic search over embedded documentation (ChromaDB) and uses an LLM to synthesize an answer.
- Output: Answer string with supporting snippets.
RiskClassificationAgentTool:- Input: Compiled evidence (summaries, code signals, doc snippets).
- Action: An LLM agent that reasons over the evidence to determine a risk tier and provide justification, potentially using other sub-tools to clarify ambiguities.
- Output: Risk tier, confidence, justification.
ObligationEvidenceFinderTool:- Input: A specific EU AI Act obligation text.
- Action: Uses semantic search (ChromaDB) and keyword search (grep) across the repository (code & docs) to find potential evidence of compliance or non-compliance. Could use an LLM to refine search queries or interpret results.
- Output: List of relevant file snippets and paths.
-
Construct the LangGraph:
- Define the graph by adding nodes and specifying edges (transitions) between them.
- Implement conditional edges for dynamic routing. For example:
- If
FileDiscoveryNodefinds no Python files, skipPythonAstAnalysisSubNode. - If
RiskAssessmentNodeconfidently determines risk deterministically, bypass LLM-based risk classification.
- If
-
Integrate LLM Agents:
- For nodes requiring complex decision-making (e.g.,
RiskAssessmentNodeif rules are insufficient, or a newComplianceVerificationNode), implement them as LLM agents (e.g., using Langchain's agent executors) equipped with the relevant tools defined in step 3. - The "EU-AI-Act-Inspector" agent concept (from Memory
09e8dd07-ea3a-4a0b-bb47-10b356b2da5e) can serve as an inspiration here. An agent could be tasked with verifying a set of obligations, using tools to gather evidence from the codebase and documentation.
- For nodes requiring complex decision-making (e.g.,
-
Refine and Iterate:
- Start by migrating a subset of the current pipeline into a simple LangGraph structure.
- Incrementally add more nodes, tools, and agentic capabilities.
- Focus on areas where LLM reasoning can provide the most value, such as interpreting ambiguous code or documentation in the context of legal obligations.
-
Human-in-the-Loop (HITL):
- Design graph interruption points for "Unclear" statuses or low-confidence LLM decisions, allowing human experts to review and provide input before the process continues. LangGraph's state management can facilitate this.
D. Expected Benefits of This Evolution
- Enhanced Analytical Depth: Agents can perform more nuanced analysis by strategically using tools to investigate code and documentation.
- Improved Accuracy: LLM reasoning combined with targeted tools can lead to more accurate risk assessments and compliance checks.
- Greater Extensibility: Adding support for new programming languages, document types, or compliance checks becomes a matter of adding new tools or graph nodes.
- Better Error Handling & Resilience: Isolate failures within specific nodes and implement more granular retry logic.
- Increased Transparency: With LangSmith or similar tracing, the decision-making process of agents and the flow through the graph become more observable.
Based on our recent database persistence implementation, here are some immediate opportunities for further enhancement:
A. Authentication and Authorization
- Implement user authentication to protect the API endpoints.
- Add role-based access control for different types of users (e.g., administrators, analysts).
- Secure sensitive data and API keys.
B. Performance Optimization
- Implement caching for frequently accessed data (e.g., scan records, repository information).
- Optimize database queries with proper indexing and query optimization.
- Consider implementing a job queue for long-running scan operations.
C. Enhanced API Functionality
- Add more filtering options for scan records (e.g., by date range, repository owner).
- Implement search functionality for scan records.
- Add endpoints for statistical analysis of scan results (e.g., risk tier distribution, common compliance issues).
D. User Interface
- Develop a web-based dashboard for visualizing scan results and history.
- Implement interactive visualizations for risk assessment and compliance status.
- Add user-friendly forms for initiating new scans and viewing results.
E. Monitoring and Observability
- Implement comprehensive logging for all operations.
- Add metrics collection for API usage and performance monitoring.
- Set up alerts for critical errors or unusual patterns.
F. Testing and CI/CD
- Expand test coverage to include new database and API functionality.
- Implement integration tests for the complete scan workflow.
- Set up continuous integration and deployment pipelines.
This phased approach will allow for a gradual but powerful transformation of the RepoScanner, leveraging the strengths of LangGraph for orchestration and LLM agents for intelligent, tool-augmented analysis, while building on our recent database persistence enhancements.