Skip to content

[lFX Term 1 2026 ] Restoring Ianvs LLM-Agent setup and usage#407

Open
NishantSinghhhhh wants to merge 1 commit intokubeedge:mainfrom
NishantSinghhhhh:restoration-llm-agent
Open

[lFX Term 1 2026 ] Restoring Ianvs LLM-Agent setup and usage#407
NishantSinghhhhh wants to merge 1 commit intokubeedge:mainfrom
NishantSinghhhhh:restoration-llm-agent

Conversation

@NishantSinghhhhh
Copy link
Copy Markdown
Contributor

feat: add requirements.txt for dependencies
fix: refactor basemodel.py for improved readability and functionality
refactor: enhance rouge.py to utilize RougeScorer for metric calculations

What type of PR is this?

/kind feature
/kind cleanup

What this PR does / why we need it:

This PR fixes and refactors the llm-agent singletask learning benchmark to make it fully functional end-to-end. The original example code had several issues that prevented it from running: broken relative paths, a missing dataset, deprecated HuggingFace API arguments, a name collision with the Ianvs framework lifecycle hook, and a broken ROUGE metric script.

Changes included:

  • requirements.txt: Added a requirements.txt listing all dependencies needed to run the LLM-agent benchmark (torch, transformers, peft, datasets, evaluate, rouge_score), which were previously undocumented and missing from the environment.

  • basemodel.py:

    • Replaced deprecated use_auth_token= argument with token= to match current HuggingFace transformers API
    • Added empty preprocess(self, **kwargs) lifecycle hook required by the Ianvs singletask learning framework
    • Renamed internal helper preprocess()_preprocess_sample() to avoid collision with the framework hook
    • Updated _preprocess_sample() signature to accept plain strings instead of a samples object
    • Flattened the return value of _preprocess_sample() (removed erroneous [None] wrapper)
    • Added explicit str() cast in train() loop when iterating train_data.x / train_data.y to handle numpy.str_ types that caused tokenizer failures
  • rouge.py:

    • Removed bare EOF token at end of file (invalid Python causing NameError on import)
    • Replaced evaluate.load() (which required a local metrics folder that did not exist) with direct rouge_score.rouge_scorer.RougeScorer calls
    • Updated y_pred handling to use str() cast instead of ["generated_text"] dict access, matching the plain-string output of basemodel.predict()

Which issue(s) this PR fixes:

Fixes #

feat: add requirements.txt for dependencies
fix: refactor basemodel.py for improved readability and functionality
refactor: enhance rouge.py to utilize RougeScorer for metric calculations

Signed-off-by: NishantSinghhhhh <nishantsingh_230137@aitpune.edu.in>
@kubeedge-bot kubeedge-bot added kind/feature Categorizes issue or PR as related to a new feature. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Apr 23, 2026
@kubeedge-bot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: NishantSinghhhhh
To complete the pull request process, please assign jaypume after the PR has been reviewed.
You can assign the PR to them by writing /assign @jaypume in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubeedge-bot kubeedge-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 23, 2026
@NishantSinghhhhh
Copy link
Copy Markdown
Contributor Author

Screencast.from.2026-04-23.13-42-27.webm

@MooreZheng sir,

After making all these changes I was able to restore LLM-Agent Benchmark and run it successfully

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly updates the Ianvs LLM-Agent benchmark by providing a comprehensive reproduction guide, adding a requirements file, and refactoring the core model and evaluation logic. Key changes include a rewritten predict method that correctly slices prompt tokens from the output and an updated ROUGE scoring implementation using the rouge_score library. Review feedback focuses on ensuring input tensors are moved to the correct device, removing redundant imports, adopting idiomatic boolean checks, and utilizing the internal calculate_mean function to prevent potential division-by-zero errors in metric calculations.

Comment thread examples/llm-agent/singletask_learning_bench/testenv/rouge.py
Comment thread examples/llm-agent/singletask_learning_bench/testenv/rouge.py
Comment thread examples/llm-agent/singletask_learning_bench/testenv/rouge.py
@NishantSinghhhhh NishantSinghhhhh changed the title [lFX Term 1 2026 ] Restoring Ianvs LLM-Agent Benchmark setup and usage [lFX Term 1 2026 ] Restoring Ianvs LLM-Agent setup and usage Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants