Skip to content

Latest commit

 

History

History
251 lines (194 loc) · 19 KB

File metadata and controls

251 lines (194 loc) · 19 KB

RepoScanner: Current Implementation and Future Evolution with LangGraph

This document details the current architecture and functionality of the RepoScanner application, including recent database persistence enhancements, and outlines a strategic path towards refactoring it into a more robust, flexible, and intelligent system using LangGraph and agentic LLM tools.

1. Current Implementation

The RepoScanner application is designed to analyze GitHub repositories for compliance with the EU AI Act. Its core logic resides primarily within app/scanner.py, orchestrated by the scan_repo function, with recent enhancements for database persistence and API functionality.

A. Core Workflow (app.scanner.scan_repo)

The scan_repo function executes a sequential pipeline:

  1. Input Processing:

    • Accepts a RepoInputModel which can contain either a direct repository URL or structured details (owner, repo, branch).
    • Uses resolve_repo_input to ensure owner, repo, and branch are determined.
  2. Repository Acquisition:

    • get_repo_archive_info: Fetches repository metadata, including the ZIP archive URL and commit SHA.
    • download_repo_zip: Downloads the repository ZIP archive into a temporary directory.
    • unzip_archive: Extracts the archive contents. A temporary directory (created via tempfile.mkdtemp) is used and cleaned up afterwards (shutil.rmtree).
  3. File Discovery & Content Caching:

    • find_documentation_files: Scans the unzipped repository for Markdown (*.md) and OpenAPI (*.yaml, *.json) files.
    • find_code_files: Scans for Python (*.py), JavaScript (*.js), and TypeScript (*.ts) files.
    • Content Caching: File contents are read once using open(..., errors='ignore') and stored in a file_content_cache dictionary to avoid redundant reads and manage potential recursion issues.
  4. Documentation Analysis:

    • Markdown: For each Markdown file, extract_text_and_headings_from_markdown extracts its textual content and major headings.
    • OpenAPI: For each OpenAPI file, parse_openapi_file parses its content (using yaml.safe_load).
    • LLM Summarization: The collected text from Markdown and OpenAPI specs is passed to llm_service.summarize_documentation (an instance of LLMService from app.services.llm_service.py) to generate a concise summary of the repository's purpose, data, and users.
  5. Code Analysis:

    • Python: For each Python file, analyze_python_code_ast performs static analysis using the ast module to detect specific patterns (e.g., imports indicative of General Purpose AI usage).
    • JavaScript/TypeScript: analyze_js_ts_code_ast is a placeholder for future JS/TS AST analysis.
    • Grep-based Signals: run_grep_search executes grep commands on the codebase to find keywords related to biometric data, real-time processing, etc., contributing to CodeSignal flags.
  6. Vector Processing & Semantic Search (app.vector_processing.py):

    • Obligation Embedding: upsert_obligation_documents (called once, typically on application startup or when obligations change) processes the eu_ai_act_obligations.yaml file. It uses Langchain and an OpenAI embedding model (e.g., text-embedding-ada-002, configured via app.config.Settings) to create embeddings for each obligation and stores them in a ChromaDB collection (CHROMA_PERSIST_PATH, CHROMA_COLLECTION_NAME).
    • Repository Document Embedding: upsert_repository_documents takes RepositoryFile objects (containing path, content, and file type like "doc", "code", "openapi") for all relevant files from the scanned repository. It chunks these documents, generates embeddings, and upserts them into the same ChromaDB collection.
    • Fuzzy Matching: For each processed repository document (Markdown, OpenAPI, Python code), find_matching_obligations_for_repo_doc queries ChromaDB to find semantically similar EU AI Act obligations. This produces FuzzyMatchResult objects.
  7. Risk Tier Determination:

    • determine_risk_tier: This function (currently a simplified placeholder) is intended to use the document summary, code signals, and potentially user inputs to assign a risk tier (prohibited, high, limited, minimal). It can fall back to llm_service.classify_risk_with_llm if deterministic rules are insufficient.
  8. Checklist Generation:

    • load_and_get_checklist_for_tier: Based on the determined risk tier, this function loads the relevant obligations from eu_ai_act_obligations.yaml to form the compliance checklist.
  9. Output & Persistence:

    • The process culminates in a ScanResultModel (defined in app.models.py) containing the determined risk tier, the compliance checklist, document summary, code signals, and any fuzzy matches found.
    • The scan results are now persisted to a database using the persist_scan_data_node in the LangGraph flow, which saves the data as a ScanRecord in the database.
    • The persisted data can be retrieved via new API endpoints for scan history and record retrieval.

B. Key Modules & Components

  • app/scanner.py: Orchestrates the main scanning logic.
  • app/models.py: Defines Pydantic models for input, output, and intermediate data structures (e.g., RepoInputModel, ScanResultModel, RepositoryFile, FuzzyMatchResult, ScanRecordResponse).
  • app/services/llm_service.py: Handles all interactions with the OpenAI API (summarization, classification) using the openai library (v1.3.0+ for async).
  • app/vector_processing.py: Manages embedding generation (Langchain, OpenAI embeddings) and semantic search against the ChromaDB vector store.
  • app/config.py: Uses pydantic-settings to manage application configurations (API keys, model names, ChromaDB paths, database URLs) loaded from environment variables or a .env file.
  • app/db/: Contains database-related modules:
    • app/db/base_class.py: Defines the SQLAlchemy declarative base class.
    • app/db/session.py: Manages database engine and session creation.
    • app/db/models/scan_record.py: Defines the SQLAlchemy model for scan records.
  • app/crud/: Contains CRUD operations for database models:
    • app/crud/crud_scan_record.py: Implements create and retrieve operations for scan records.
  • app/graph_nodes.py: Contains LangGraph nodes, including the new persist_scan_data_node for database persistence.
  • app/graph_orchestrator.py: Orchestrates the LangGraph execution flow.
  • data/eu_ai_act_obligations.yaml: Stores the structured EU AI Act obligations, serving as the criteria store.
  • Testing (tests/): Utilizes pytest and pytest-mock for comprehensive unit testing, with detailed mocking of external services (GitHub, OpenAI) and file system operations.

C. Dependencies

  • Core: Python 3.11+
  • Web (for API, though scanner can run independently): FastAPI
  • Data Validation & Settings: Pydantic, Pydantic-Settings
  • LLM Interaction: openai (v1.3.0+ for async support)
  • Vector Embeddings & Storage: langchain, langchain-openai, langchain-chroma, chromadb
  • HTTP Client: httpx (for async GitHub API calls)
  • YAML Processing: PyYAML
  • File Handling: zipfile, tempfile, shutil
  • Database ORM: SQLAlchemy (v2.0+ for async support)
  • Database Drivers: asyncpg, psycopg2-binary (PostgreSQL), aiosqlite (SQLite for development)
  • Graph Orchestration: langgraph

2. Recent Enhancements: Database Persistence and API Endpoints

We've recently implemented significant improvements to the RepoScanner application:

A. Database Persistence

  • Database Models: Created a ScanRecord SQLAlchemy model to store scan results in a relational database (PostgreSQL for production, SQLite for development).
  • CRUD Operations: Implemented create and retrieve operations in app/crud/crud_scan_record.py for database interactions.
  • Graph Integration: Added a new persist_scan_data_node to the LangGraph flow to save scan results to the database.
  • State Management: Enhanced ScanGraphState with db_session and persisted_record_id fields to manage database operations within the graph execution.

B. API Enhancements

  • New Endpoints: Implemented three new API endpoints for retrieving scan history:
    • GET /api/v1/scan-records: Lists all scan records with pagination and filtering by risk tier.
    • GET /api/v1/scan-history: Gets scan history for a specific repository using a query parameter.
    • GET /api/v1/scan-records/{scan_id}: Gets a specific scan record by ID.
  • Response Models: Created a ScanRecordResponse Pydantic model for consistent API responses.
  • Data Conversion: Added utilities to convert between SQLAlchemy models and Pydantic models.

3. Next Steps: Evolving with LangGraph & Agentic Tools

The RepoScanner application has already begun its migration to LangGraph, with the implementation of database persistence nodes. We can continue to enhance it with more sophisticated LLM-powered "agentic tools" and additional features.

A. Vision

Transform the RepoScanner into a graph-based system where:

  • Nodes represent distinct processing stages (e.g., "Fetch Code," "Analyze Python AST," "Generate Summary").
  • Edges define the flow of data and control between nodes, potentially with conditional routing.
  • State is a well-defined object that evolves as it passes through the graph.
  • Agentic Tools are specialized functions (often LLM-powered) that nodes can invoke to perform complex sub-tasks, make decisions, or gather specific information.

B. Why LangGraph?

  • Modularity & Maintainability: Each logical step becomes an independent node, making the system easier to understand, test, and modify.
  • Flexibility & Control Flow: Complex workflows with conditional branching (e.g., "if high-risk signals found, invoke detailed security analysis node") become more manageable.
  • State Management: LangGraph provides a robust way to manage and pass the application's state between components.
  • Observability: Integration with tools like LangSmith for tracing and debugging complex chains and agent interactions.
  • Resilience: Easier to implement retries and error handling for specific nodes.
  • Agentic Capabilities: Provides a natural framework for building and orchestrating LLM agents that can use a suite of tools.

C. Proposed Refactoring & Enhancement Steps

  1. Define the Core Graph State:

    • Create a comprehensive Pydantic model (e.g., ScanGraphState) to hold all data that needs to be passed between nodes. This would include:
      • Input parameters (RepoInputModel).
      • Paths (temp repo path, specific file paths).
      • Discovered file manifests.
      • Cached file contents.
      • Extracted text, headings, parsed OpenAPI data.
      • Code analysis results (AST, grep signals).
      • LLM summaries and classifications.
      • Embeddings status.
      • Fuzzy match results.
      • Determined risk tier and checklist.
      • Error/status flags for various stages.
  2. Identify and Implement Graph Nodes: Convert existing functional blocks from scanner.py and vector_processing.py into LangGraph nodes. Each node will be a function that accepts the current ScanGraphState and returns a dictionary updating parts of the state.

    • InitialSetupNode: Initializes state from RepoInputModel.
    • FetchRepositoryNode: Handles get_repo_archive_info, download_repo_zip, unzip_archive. Updates state with temp_repo_path.
    • FileDiscoveryNode: Runs find_documentation_files, find_code_files. Updates state with lists of file paths.
    • ContentExtractionNode: Iterates through discovered files, reads content (using the caching strategy), and stores it in the state.
    • DocumentationProcessingNode:
      • Processes Markdown files (extract_text_and_headings_from_markdown).
      • Processes OpenAPI files (parse_openapi_file).
      • Invokes LLMService.summarize_documentation. Updates state with summaries.
    • CodeAnalysisNode:
      • PythonAstAnalysisSubNode: Runs analyze_python_code_ast.
      • GrepSignalSubNode: Runs run_grep_search.
      • (Future) JsTsAstAnalysisSubNode. Updates state with CodeSignal results.
    • RepositoryEmbeddingNode: Calls upsert_repository_documents using content from the state.
    • ObligationMatchingNode: Calls find_matching_obligations_for_repo_doc for relevant processed documents. Updates state with FuzzyMatchResult list.
    • RiskAssessmentNode: Implements determine_risk_tier logic, potentially calling an LLM agent/tool for complex cases. Updates state with risk_tier.
    • ChecklistGenerationNode: Calls load_and_get_checklist_for_tier. Updates state with the checklist.
    • ReportCompilationNode: Assembles the final ScanResultModel from the state.
    • PersistScanDataNode: ✅ Implemented. Persists scan results to the database using the create_scan_record function.
    • CleanupNode: Removes the temporary directory.
  3. Develop Agentic Tools: Wrap specific functionalities, especially those involving LLM reasoning or complex data manipulation, into Langchain Tool objects. These tools can then be used by more sophisticated agentic nodes.

    • CodeInspectorTool:
      • Input: Code snippet, list of inspection criteria (e.g., "check for insecure data handling," "identify PII usage").
      • Action: Uses LLM (and potentially static analysis sub-tools) to analyze the code against criteria.
      • Output: Analysis report, identified risks/patterns.
    • DocumentationQueryTool:
      • Input: Question about the repository's documentation (e.g., "What data sources does this system use?").
      • Action: Performs semantic search over embedded documentation (ChromaDB) and uses an LLM to synthesize an answer.
      • Output: Answer string with supporting snippets.
    • RiskClassificationAgentTool:
      • Input: Compiled evidence (summaries, code signals, doc snippets).
      • Action: An LLM agent that reasons over the evidence to determine a risk tier and provide justification, potentially using other sub-tools to clarify ambiguities.
      • Output: Risk tier, confidence, justification.
    • ObligationEvidenceFinderTool:
      • Input: A specific EU AI Act obligation text.
      • Action: Uses semantic search (ChromaDB) and keyword search (grep) across the repository (code & docs) to find potential evidence of compliance or non-compliance. Could use an LLM to refine search queries or interpret results.
      • Output: List of relevant file snippets and paths.
  4. Construct the LangGraph:

    • Define the graph by adding nodes and specifying edges (transitions) between them.
    • Implement conditional edges for dynamic routing. For example:
      • If FileDiscoveryNode finds no Python files, skip PythonAstAnalysisSubNode.
      • If RiskAssessmentNode confidently determines risk deterministically, bypass LLM-based risk classification.
  5. Integrate LLM Agents:

    • For nodes requiring complex decision-making (e.g., RiskAssessmentNode if rules are insufficient, or a new ComplianceVerificationNode), implement them as LLM agents (e.g., using Langchain's agent executors) equipped with the relevant tools defined in step 3.
    • The "EU-AI-Act-Inspector" agent concept (from Memory 09e8dd07-ea3a-4a0b-bb47-10b356b2da5e) can serve as an inspiration here. An agent could be tasked with verifying a set of obligations, using tools to gather evidence from the codebase and documentation.
  6. Refine and Iterate:

    • Start by migrating a subset of the current pipeline into a simple LangGraph structure.
    • Incrementally add more nodes, tools, and agentic capabilities.
    • Focus on areas where LLM reasoning can provide the most value, such as interpreting ambiguous code or documentation in the context of legal obligations.
  7. Human-in-the-Loop (HITL):

    • Design graph interruption points for "Unclear" statuses or low-confidence LLM decisions, allowing human experts to review and provide input before the process continues. LangGraph's state management can facilitate this.

D. Expected Benefits of This Evolution

  • Enhanced Analytical Depth: Agents can perform more nuanced analysis by strategically using tools to investigate code and documentation.
  • Improved Accuracy: LLM reasoning combined with targeted tools can lead to more accurate risk assessments and compliance checks.
  • Greater Extensibility: Adding support for new programming languages, document types, or compliance checks becomes a matter of adding new tools or graph nodes.
  • Better Error Handling & Resilience: Isolate failures within specific nodes and implement more granular retry logic.
  • Increased Transparency: With LangSmith or similar tracing, the decision-making process of agents and the flow through the graph become more observable.

4. Immediate Improvement Opportunities

Based on our recent database persistence implementation, here are some immediate opportunities for further enhancement:

A. Authentication and Authorization

  • Implement user authentication to protect the API endpoints.
  • Add role-based access control for different types of users (e.g., administrators, analysts).
  • Secure sensitive data and API keys.

B. Performance Optimization

  • Implement caching for frequently accessed data (e.g., scan records, repository information).
  • Optimize database queries with proper indexing and query optimization.
  • Consider implementing a job queue for long-running scan operations.

C. Enhanced API Functionality

  • Add more filtering options for scan records (e.g., by date range, repository owner).
  • Implement search functionality for scan records.
  • Add endpoints for statistical analysis of scan results (e.g., risk tier distribution, common compliance issues).

D. User Interface

  • Develop a web-based dashboard for visualizing scan results and history.
  • Implement interactive visualizations for risk assessment and compliance status.
  • Add user-friendly forms for initiating new scans and viewing results.

E. Monitoring and Observability

  • Implement comprehensive logging for all operations.
  • Add metrics collection for API usage and performance monitoring.
  • Set up alerts for critical errors or unusual patterns.

F. Testing and CI/CD

  • Expand test coverage to include new database and API functionality.
  • Implement integration tests for the complete scan workflow.
  • Set up continuous integration and deployment pipelines.

This phased approach will allow for a gradual but powerful transformation of the RepoScanner, leveraging the strengths of LangGraph for orchestration and LLM agents for intelligent, tool-augmented analysis, while building on our recent database persistence enhancements.