Skip to content

devcolor/insight-alchemists-datathon

Repository files navigation

2025 Gates Foundation Datathon

Team: Insight Alchemists

An AI-powered Copilot chatbot that answers research questions about Postsecondary Data Partnership (PDP) data from the National Student Clearinghouse.

The Postsecondary Data Partnership (PDP) is a nationwide effort to help colleges and universities gain a fuller picture of student progress and outcomes, meet various reporting requirements, and identify where to focus their resources. Both institutions and educational organizations can participate in the PDP.

National Student Clearinghouse

This chatbot will help employees learn how to leverage PDP AR course, cohort and financial aid data files in their work with students.

Copilot Agent Creation

Copilot Agent Interface

Table of Contents


Quick Start

Get started with the AR data analysis agent by choosing one of the following options:

Option 1: Create M365 Copilot Agent (Recommended for Production)

This creates an AI-powered chatbot that can query the AR database through the Azure Function API.

Step 1: Generate Data Dictionaries (optional - metadata already included)

The data dictionaries provide context about the AR data structure. The metadata/ directory already contains generated data dictionaries, but you can regenerate them if needed:

cd data_fetching
make build && make dictionaries

This generates:

  • metadata/ar_data_dictionary.csv - Complete data dictionary for all AR fields
  • metadata/institutional_knowledge.csv - Context on data strengths, limitations, and common joins

These files will be used by the Copilot Agent to understand the data structure.

Step 2: Create Copilot Agent

  1. Navigate to M365 Copilot Studio
  2. Click "Create Agent" button in the left sidebar
  3. Choose a name for your agent (e.g., "AR Data Analyst")
  4. In the chatbox at the bottom left, paste one of the prompts from CoPilot Agent Prompts/:
    • PDP Question Seer.txt - For question understanding and discoverability
    • PDP Data Alchemist.txt - For detailed data analysis (takes longer to process)
    • Prompt Minion.txt - For optimizing and creating prompts
  5. Press Enter - Copilot will create the agent based on the prompt
  6. Test the agent in the preview panel on the right side
  7. Click "Create" button in the top-right to finalize

Available Prompts (in CoPilot Agent Prompts/):

  • PDP Question Seer.txt - Analyzes if questions can be answered with the dataset
  • PDP Data Alchemist.txt - Provides detailed data analysis responses
  • Prompt Minion.txt - Helps optimize agent prompts
  • Test Questions.txt - Sample questions for testing

Note: See CoPilot Agent Prompts/prompts_readme.md for more details on each prompt.

Step 3: Create and Deploy Azure Function

The Azure Function provides the API endpoint that the Copilot Agent will call to execute SQL queries.

# Navigate to the Azure Function directory
cd agent-sql-connector

# Create and activate virtual environment (if not already done)
python3.10 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Deploy the Azure Function to Azure
func azure functionapp publish datathon-query-2025-aa --build remote

# Note the endpoint URL: https://datathon-query-2025-aa.azurewebsites.net/api/query

Note: Ensure you have Azure Functions Core Tools v4 installed and are authenticated with Azure CLI.

Step 4: Connect Agent to Azure Function

Now that both the agent and function are created, connect them:

  1. In your Copilot agent, go to Actions or Connections settings
  2. Add an HTTP action/connection to call the Azure Function endpoint
  3. Configure the connection:
    • URL: https://datathon-query-2025-aa.azurewebsites.net/api/query
    • Method: POST
    • Headers: Content-Type: application/json
    • Body: JSON with SQL query: {"sql": "{{generated_sql_query}}"}

The agent should:

  1. Generate SQL queries based on user questions
  2. Call the Azure Function with the generated SQL
  3. Format and return the results to the user

Step 5: Test Your Agent

Try asking questions like:

  • "What is the GPA of students who take Psychology in their 2nd year?"
  • "How many students are retained by cohort?"
  • "What is the retention rate for first-generation students?"

The agent will generate SQL queries, call the Azure Function, and return formatted results.

Option 2: Query Database Directly (Recommended for Testing)

# 1. Start the database (automatic bootstrap)
cd agent-sql-connector
pip install -r requirements.txt
python db_query.py

This will automatically:

  • Detect if database is running
  • Start Docker containers if needed
  • Wait for initialization
  • Connect and run example queries

Option 3: Use Azure Function API Directly

# 1. Start database (from sql_integration)
cd sql_integration
docker compose up -d

# 2. Start Azure Function locally
cd ../agent-sql-connector
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
func start

# 3. Test the endpoint
curl -X POST http://localhost:7071/api/query \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT TOP 10 * FROM ar_courses"}'

SQL Connector Results

Architecture

This project has four main components:

1. Copilot Agent

  • Location: CoPilot Agent Prompts/
  • Purpose: LLM prompts and agent configurations for querying AR data
  • Prompts:

Note: These prompts are designed for use with M365 Copilot Studio. See CoPilot Agent Prompts/prompts_readme.md for setup instructions.

2. SQL Server Integration

  • Location: sql_integration/
  • Purpose: Database backend with AR course, cohort, and financial aid data
  • Bootstrap Script: agent-sql-connector/db_query.py - Automatically starts database and connects
  • Features:
    • Auto-detects if database is running
    • Starts Docker containers if needed
    • Query guardrails (blocks destructive operations)
    • Returns results as pandas DataFrames
    • Used by Azure Function handler

3. Azure Function Query Proxy

  • Location: agent-sql-connector/
  • Purpose: Processes database queries triggered by the Copilot Agent, handles HTTP POST logic, and executes safe SQL queries
  • Documentation: agent-sql-connector/README.md
  • Endpoint: POST /api/query (unauthenticated MVP)
  • Deployment: Azure Function App on Consumption Plan
  • Features:
    • Receives triggers from Copilot Agent and processes HTTP POST requests
    • Executes safe SQL SELECT queries via db_query.query() function
    • Environment variable configuration for database connections
    • Supports ngrok tunneling for local database testing
    • Returns JSON responses with query results

4. Data Foundation

  • Location: metadata/
  • Pull & Transformed by tooling under data_fetching/
  • Purpose: AR data dictionary and institutional context
  • Files:
    • ar_data_dictionary.csv - Complete data dictionary for all AR fields
    • institutional_knowledge.csv - Context on data strengths, limitations, and common joins

System Architecture Diagram


Copilot Agent Setup

Simple Usage

The Copilot Agent uses LLM prompts located in CoPilot Agent Prompts/ to interact with the AR data through the Azure Function Query Proxy.

Available Prompts

  • PDP Data Alchemist: Prompt for detailed data analysis responses (takes longer to process)

  • PDP Question Seer: Prompt for question understanding and discoverability

  • Test Questions: Sample questions for testing agent responses

Additionally:

  • Prompt Minion: Prompt for optimizing and creating prompts See CoPilot Agent Prompts/prompts_readme.md for detailed usage instructions.

Integration with Azure Function

The Copilot Agent triggers the Azure Function, which then handles the database query logic:

  1. Copilot Agent → Invokes/triggers the Azure Function (via function invocation, not direct HTTP)
  2. Azure Function → Receives the trigger and handles the HTTP POST request processing
  3. Azure Function → Executes the SQL query via db_query.query() function
  4. Azure Function → Returns JSON results back to the Copilot Agent

The Azure Function exposes the endpoint at:

POST https://datathon-query-2025-aa.azurewebsites.net/api/query
Content-Type: application/json

{
  "sql": "SELECT TOP 10 * FROM ar_courses"
}

Configuration

  1. Configure the Copilot Agent to trigger the Azure Function endpoint
  2. Use the prompts in CoPilot Agent Prompts/ as templates for your LLM agent
  3. The Azure Function enforces query safety (SELECT-only queries) and returns results as JSON

See agent-sql-connector/README.md for Azure Function API documentation.


SQL Server Integration

Automatic Setup (Recommended)

# This one script does everything:
cd agent-sql-connector
python db_query.py

The script will:

  1. Detect if database is running
  2. Start Docker containers if needed
  3. Wait for initialization (60-90 seconds)
  4. Connect and run example queries

We ran out of time to include a cleanup step, but you can do the following to clean up the environment between runs:

  1. cd sql_integration
  2. make clean-data

Manual Setup

cd sql_integration
./bootstrap.sh           # Install dependencies
docker compose up --build # Start database
python test_query.py     # Test connection

Database Tables

  • ar_courses - Course-level enrollment data
  • ar_cohorts - Student cohort tracking and outcomes
  • ar_financial_aid - Financial aid information
  • llm_debug_logs - Chatbot interaction logs
  • weekly_trends - Aggregated question analytics

Architecture Documentation

See sql_integration/ARCHITECTURE.md for detailed architecture documentation.


Azure Function Query Proxy

The Azure Function provides a REST API endpoint that allows the Copilot Agent to execute database queries remotely.

Quick Start

cd agent-sql-connector
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
func start  # Local development

Deployment

# Deploy to Azure
func azure functionapp publish datathon-query-2025-aa --build remote

Features

  • HTTP Endpoint: POST /api/query accepts SQL queries and returns JSON results
  • Query Safety: Enforces SELECT-only queries (no destructive operations)
  • Remote Access: Allows Copilot Agent to query database from anywhere
  • Environment Configuration: Database connection via Azure App Settings
  • Local Testing: Supports ngrok tunnel for testing against local database

Testing with Local Database

For development, you can use ngrok to expose your local SQL Server:

# 1. Start ngrok tunnel
ngrok tcp 1433

# 2. Configure Azure Function App settings
az functionapp config appsettings set \
  --name datathon-query-2025-aa \
  --resource-group datathon \
  --settings "DB_SERVER=<ngrok_hostname>,<port>" \
             "DB_DATABASE=InsightAlchemists" \
             "DB_USERNAME=SA" \
             "DB_PASSWORD=<password>"

Documentation

See agent-sql-connector/README.md for:

  • Complete setup instructions
  • System architecture diagram
  • Environment variable configuration
  • Testing and deployment guides
  • Security considerations

Data Dictionary Generation

The metadata/ directory already contains generated data dictionaries. To regenerate:

Quick Start

cd data_fetching
make build
make dictionaries

Output Files

Generated files are saved to metadata/:

  • ar_data_dictionary.csv - Auto-generated data dictionary from source files
  • institutional_knowledge.csv - Manually maintained institutional context

Detailed Documentation

See data_fetching/README.md for:

  • Prerequisites and dependencies
  • Available make commands
  • Data sources and format
  • Troubleshooting tips

SQL Cookbooks

We saved some quick SQL cookbook patterns to sql_cookbooks/README.md. These queries are a beginning start at analyzing:

  • Retention metrics
  • Gateway completion impact on retention
  • Credit Momentum Analysis
  • Equity Gap Analysis across student populations
  • Course-Level Success (using course data); delivery method impact
  • Program of Study Success Rates
  • Toxic vs Synergetic course loads

Project Structure

insight-alchemists-datathon/
├── CoPilot Agent Prompts/    # LLM prompts for Copilot
│   ├── PDP Data Alchemist.txt
│   ├── PDP Question Seer.txt
│   ├── Prompt Minion.txt
│   ├── Test Questions.txt
│   └── prompts_readme.md
├── sql_integration/          # SQL Server Docker setup
│   ├── docker-compose.yml   # Database container config
│   ├── Dockerfile
│   ├── setup.sql            # Database schema and data import
│   ├── bootstrap.sh         # Dependency installation
│   ├── entrypoint.sh        # Container startup script
│   ├── test_query.py        # Connection test script
│   ├── Makefile            # Convenience commands
│   ├── ARCHITECTURE.md      # Architecture documentation
│   └── README.md
├── agent-sql-connector/      # Azure Function Query Proxy
│   ├── fetch_proxy/          # Function handler code
│   │   ├── __init__.py
│   │   ├── function_app.py
│   │   └── handler.py
│   ├── query/                # Function configuration
│   │   └── function.json
│   ├── tests/               # Unit tests
│   │   ├── __init__.py
│   │   └── test_handler.py
│   ├── synthetic_data/      # Local test data
│   ├── db_query.py          # Shared query function (canonical source)
│   ├── requirements.txt     # Function dependencies
│   ├── host.json            # Azure Functions config
│   ├── local.settings.json  # Local settings (gitignored)
│   ├── meta.yaml            # Function App metadata
│   ├── MVP_SECURITY_NOTE.md # Security documentation
│   ├── project_plan.md      # Development plan
│   └── README.md            # Complete Azure Function documentation
├── metadata/                 # Data dictionaries
│   ├── ar_data_dictionary.csv
│   └── institutional_knowledge.csv
├── data_fetching/           # Dictionary generation tools
│   ├── docker-compose.yml
│   ├── Dockerfile
│   ├── prep_dictionaries.py
│   ├── Makefile
│   └── README.md
├── synthetic_data/          # Sample/test data (root level)
│   ├── ar_cohorts.csv
│   ├── ar_courses.csv
│   ├── ar_financial_aid.csv
│   └── README.md
└── sql_cookbooks/           # Example SQL query patterns
    └── README.md

Quick Reference

Common Commands

# Query database directly (using agent-sql-connector)
cd agent-sql-connector && python db_query.py

# Start database
cd sql_integration && docker compose up -d

# View database logs
cd sql_integration && docker compose logs

# Stop database
cd sql_integration && docker compose down

# Test database connection
cd sql_integration && python test_query.py

# Test Azure Function locally
cd agent-sql-connector && func start

# Deploy Azure Function
cd agent-sql-connector && func azure functionapp publish datathon-query-2025-aa --build remote

# Regenerate data dictionaries
cd data_fetching && make dictionaries

Requirements

  • Python 3.10+ (Azure Function requires 3.10; local development works with 3.9+)
  • Docker Desktop
  • ODBC Driver 18 for SQL Server (auto-installed by bootstrap.sh)
  • Azure Functions Core Tools v4 (for Azure Function development)

Support

For issues or questions:

  1. Check component-specific READMEs in subdirectories
  2. Review Docker logs: docker compose logs
  3. Verify Docker is running: docker ps

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •