Skip to content

Conversation

@bhavishya-pohani
Copy link

@bhavishya-pohani bhavishya-pohani commented Jan 26, 2026

Summary

Adds FinQA environment for evaluating LLMs on financial question-answering tasks using SEC 10-K filing data.

  • Tool-calling based environment with SQL queries on financial tables
  • 290 questions across multiple companies (Alphabet, Amazon, Apple, etc.)
  • Fuzzy numerical matching for reward computation (handles percentages, fractions, LaTeX formatting)
  • Auto-generated OpenAI tool schemas from function docstrings
  • HuggingFace data download script

Features

  • Tools: get_descriptions, get_table_info, sql_query, submit_answer
  • Reward: Binary (1.0 correct, 0.0 incorrect) with 1% tolerance
  • Data: Downloaded from HuggingFace via download_data.sh

Test Plan

  • Unit tests for reward matching (49 tests passing)
  • Docker build and inference script working

…ference script

  - Add /tools endpoint to expose tool schemas in OpenAI function calling format
  - Auto-generate tool schemas from function docstrings (tool_schema.py)
  - Add download_data.sh to fetch data from HuggingFace                                     - Fix reward computation for multi-value answers (multiple \boxed{} values)
  - Add comprehensive tests for reward matching
  - Remove unused imports, clean up dead code
  - Update README with download instructions
@meta-cla
Copy link

meta-cla bot commented Jan 26, 2026

Hi @bhavishya-pohani!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@greptile-apps
Copy link

greptile-apps bot commented Jan 26, 2026

Greptile Overview

Greptile Summary

Adds FinQA environment for evaluating LLMs on financial question-answering using SEC 10-K filing data. The environment follows OpenEnv architecture patterns correctly with client-server separation, reward computation in the server, and tool-based interaction model.

Key Changes:

  • Tool-calling environment with 4 tools: get_descriptions, get_table_info, sql_query, submit_answer
  • Fuzzy numerical matching for rewards (handles percentages, fractions, LaTeX formatting with 1% tolerance)
  • Auto-generated OpenAI tool schemas from function docstrings via introspection
  • 290 questions across multiple companies from HuggingFace dataset
  • Comprehensive test suite (49 tests) covering various number formats and edge cases

Architecture Alignment:

  • ✅ Rewards computed inside environment server (not client)
  • ✅ Client-server separation maintained (follows HTTPEnvClient pattern)
  • ✅ Environment inherits from core Environment interface
  • ✅ Action/Observation/State follow core type patterns
  • ✅ Docker-based deployment matching existing environments

Issues Found:

  • Critical bug in examples/finqa_inference.py:150 - undefined variable when no tool calls returned
  • Minor style issue accessing private _base attribute in inference script

Confidence Score: 4/5

  • Safe to merge after fixing the critical undefined variable bug in the inference script
  • Score reflects solid architecture following OpenEnv patterns with comprehensive testing, but docked one point for the critical logic error in examples/finqa_inference.py that would cause runtime crash in error handling path. The core environment implementation is well-designed with proper client-server separation, reward computation in server, extensive test coverage, and clean integration with existing patterns.
  • examples/finqa_inference.py requires immediate attention to fix undefined variable bug on line 150

Important Files Changed

Filename Overview
examples/finqa_inference.py Added inference script with undefined variable bug in error handling path and private attribute access
src/envs/finqa_env/client.py HTTP client implementation following HTTPEnvClient pattern correctly
src/envs/finqa_env/server/finqa_environment.py Environment implementation correctly inheriting from core Environment with reward computation in server
src/envs/finqa_env/server/tools.py Tool implementations with SQL query validation and lazy loading of table metadata
src/envs/finqa_env/server/rewards.py Comprehensive reward matching with fuzzy numerical comparison, percentage/fraction handling

Sequence Diagram

sequenceDiagram
    participant Agent
    participant Client as FinQAEnv<br/>(HTTP Client)
    participant Server as FastAPI Server
    participant Env as FinQAEnvironment
    participant Tools as FinQATools
    participant Rewards as Reward System
    
    Agent->>Client: from_docker_image("finqa-env:latest")
    Client->>Server: Start Docker container
    Server-->>Client: base_url
    
    Agent->>Client: reset()
    Client->>Server: POST /reset
    Server->>Env: reset()
    Env->>Env: Load next question from shuffled dataset
    Env-->>Server: FinQAObservation(question, company, tools)
    Server-->>Client: JSON response
    Client-->>Agent: StepResult(observation, reward=None, done=False)
    
    loop Until answer submitted or max_steps
        Agent->>Client: step(FinQAAction(tool_name, tool_args))
        Client->>Server: POST /step {tool_name, tool_args}
        Server->>Env: step(action)
        
        alt Tool is get_descriptions/get_table_info/sql_query
            Env->>Tools: execute_tool(tool_name, tool_args)
            Tools->>Tools: Load data from JSON files
            Tools->>Tools: Execute SQL in-memory (sqlite3)
            Tools-->>Env: (result_string, is_final=False)
            Env-->>Server: FinQAObservation(tool_result, done=False)
        else Tool is submit_answer
            Env->>Tools: execute_tool("submit_answer", {answer})
            Tools-->>Env: (confirmation, is_final=True)
            Env->>Rewards: compute_reward(submitted, ground_truth)
            Rewards->>Rewards: Parse numbers (%, fractions, LaTeX)
            Rewards->>Rewards: Compare with 1% tolerance + 1.0 abs diff
            Rewards-->>Env: 1.0 (correct) or 0.0 (incorrect)
            Env-->>Server: FinQAObservation(result, done=True, reward)
        end
        
        Server-->>Client: JSON response
        Client-->>Agent: StepResult(observation, reward, done)
    end
    
    Agent->>Server: GET /tools
    Server-->>Agent: OpenAI tool schemas (auto-generated from docstrings)
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 26, 2026
…numbers, add fixes & tests for multiple numbers in labels
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant