Skip to content

add test data generator for evaluation#43

Open
LikeHui92 wants to merge 18 commits into
mainfrom
feature/add_evaluation
Open

add test data generator for evaluation#43
LikeHui92 wants to merge 18 commits into
mainfrom
feature/add_evaluation

Conversation

@LikeHui92
Copy link
Copy Markdown

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@LikeHui92 LikeHui92 requested a review from lzongren May 15, 2026 01:14
Comment on lines +83 to +89
generated = generator.generate_test_cases(
teacher_samples=teacher_samples,
count=3,
power_instructions=power_instructions,
diversity_factor=0.8,
output_dir="./example_output/with_power"
)
Comment thread evaluation/test_data_generator/example.py Fixed
Comment thread evaluation/test_data_generator/intelligent_generator.py Fixed
Comment thread evaluation/test_data_generator/deduplicate_tests.py Fixed
Comment thread evaluation/test_data_generator/intelligent_generator.py Fixed
Comment thread evaluation/test_data_generator/intelligent_generator.py Fixed
Comment thread evaluation/test_data_generator/intelligent_generator.py Fixed
Comment thread evaluation/test_data_generator/domain_analyzer.py Fixed
Comment thread evaluation/test_data_generator/context_loader.py Fixed
Comment thread evaluation/test_data_generator/domain_analyzer.py Fixed
Copy link
Copy Markdown
Contributor

@lzongren lzongren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, pls address the code quality comments @LikeHui92. Not all of them are blocking but good for keeping a healthy repo from the start.

LikeHui92 and others added 9 commits May 14, 2026 23:05
… a call'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
…a call'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
…mplicit (fall through) returns'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
…once'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Comment thread evaluation/test_data_generator/intelligent_generator.py Fixed
Comment thread evaluation/test_data_generator/cli.py Fixed
LikeHui92 and others added 2 commits May 14, 2026 23:58
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
@LikeHui92
Copy link
Copy Markdown
Author

@lzongren Thanks for the review! Addressed the comments.

Copy link
Copy Markdown
Contributor

@lzongren lzongren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major Issues

1. No tests for the generator code

The test_data_generator/ package (~2400 lines of Python across 5 modules) has zero unit tests. ARCHITECTURE.md mentions a "Testing Strategy" with test_basic.py but the file doesn't exist. At minimum, structural analysis and validation logic can be tested without API calls.

2. example.py has broken imports

from intelligent_generator import IntelligentTestGenerator
from domain_analyzer import DomainAnalyzer

These bare imports won't work when running as a package (python -m test_data_generator.example). Should be relative imports: from .intelligent_generator import ...

3. Duplicated dead code in cli.py

cli.py contains a full load_source_context() function (~100 lines) that duplicates nearly all functionality of ContextLoader in context_loader.py. The function is never called since main() uses ContextLoader directly. This dead code should be removed.

…in cli.py and import issues in example.py, Remove generated_test_data from git tracking
Comment thread evaluation/test_data_generator/test_basic.py Fixed
Comment thread evaluation/test_data_generator/test_basic.py Fixed
Comment thread evaluation/test_data_generator/test_units.py Fixed
@LikeHui92
Copy link
Copy Markdown
Author

LikeHui92 commented May 16, 2026

Addressed review comments:

  • Fixed broken imports in example.py (relative imports)
  • Removed dead load_source_context() function in cli.py
  • Added one test sample which has no internal reference
  • Added comprehensive unit tests (22 tests)
  • Created evaluation README

@LikeHui92 LikeHui92 force-pushed the feature/add_evaluation branch from 768bd0c to ceb58e6 Compare May 16, 2026 17:14
Comment thread evaluation/test_data_generator/test_basic.py Fixed
Comment thread evaluation/test_data_generator/test_basic.py Fixed
Comment thread evaluation/test_data_generator/test_units.py Fixed
LikeHui92 and others added 2 commits May 16, 2026 10:20
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
@LikeHui92 LikeHui92 changed the title add test data generator and generated test samples for evaluation add test data generator for evaluation May 17, 2026
Resolved conflict in .gitignore by accepting main branch changes.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@LikeHui92
Copy link
Copy Markdown
Author

Hi @lzongren, please help review again, I've done with the requested changes are done, and resolved conflicts. Thanks!

Copy link
Copy Markdown
Contributor

@lzongren lzongren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up Review (after updates)

Great improvements — ATX references down from 223 to 1, generated test files removed, tests added, broken imports fixed. Almost there.

Remaining issues

1. No dependency manifest for evaluation/

There's no pyproject.toml or requirements.txt in the evaluation/ directory. The tool requires boto3 at runtime and the unit tests mock it, but users won't know what to install. At minimum a requirements.txt with boto3>=1.28 would help.

2. Directory name mismatch: test_sample/ vs test_samples/

The README directory tree shows test_sample/ (singular) but the actual file is at evaluation/test_samples/onboarding_intermediate.json (plural). One of them needs to be renamed to match.

3. Duplicated fields in onboarding_intermediate.json

simulated_human_guidance, timeout_seconds, and max_turns appear both at the top level AND inside the metadata object. This looks accidental — the metadata copy should be removed.

4. Emojis in log output

cli.py and test_basic.py use emojis in log messages (, , ⚠️). These render poorly in many terminal and CI environments. Consider plain text alternatives like PASS:, FAIL:, WARN:.

…fest, 2. fixed directory naming consistency, 3. removed duplicated JSON fields and so on
@LikeHui92
Copy link
Copy Markdown
Author

Hi @lzongren, thanks for the review! I've addressed the comments. Please help review again. Thank you!

)

# Analyze domain
analysis = analyzer.analyze_test_samples(teacher_samples, None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source_context is str but here None is passed.

elif field in {"metadata"}:
test[field] = {}
elif field in {"assertions", "simulated_human_guidance"}:
test[field] = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong type, simulated_human_guidance is always used as a str but here given as a []

@@ -0,0 +1,155 @@
# Test Data Generator - Testing Guide
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having three readmes incl. (TEST_README.md, READEME.md and ARCHITECTURE.md) seems to be a bit too much IMO, would you mind to compact them into 1-2 docs?

@lzongren
Copy link
Copy Markdown
Contributor

Re: evaluation/ directory placement and packaging

Root-level evaluation/ is fine — that's the common pattern in the Python ecosystem (HF transformers has benchmark/, scikit-learn has benchmarks/, vLLM has benchmarks/). No objection there.

However, this repo has a stronger convention than most: every component uses pyproject.toml with a src/ layout. A bare requirements.txt is the only packaging outlier across the entire repo. Two options worth considering:

Option A — Minimal pyproject.toml at evaluation/ (keeps it installable, consistent):

[build-system]
requires = ["setuptools>=68"]
build-backend = "setuptools.build_meta"

[project]
name = "agent-builder-evaluation"
version = "0.0.0"
requires-python = ">=3.11"
dependencies = ["boto3>=1.28", "pytest>=7.0", "pytest-cov>=4.0"]

[tool.setuptools.packages.find]
where = ["."]

version = "0.0.0" is an established convention for workspace-internal packages that won't be published (see AutoGen's autogen-test-utils for precedent).

Option B — Keep requirements.txt, drop it in a follow-up. Lower friction now, but the inconsistency remains.

Either way, would be good to add a CI job for linting/testing this code so it doesn't silently rot. The unit tests in test_units.py don't need AWS credentials, so they could run cheaply.

@LikeHui92
Copy link
Copy Markdown
Author

LikeHui92 commented May 19, 2026

@lzongren Addressed the comments, thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants