Skip to content

Conversation

@ZIYU-DEEP
Copy link
Contributor

Description

Describe your changes in detail (optional if the linked issue already contains a detailed description of the changes).

Fixes #1737. Changes made in:

  • ./examples/datagen/evol_instruct
  • ./camel/datagen/evol_instruct

Checklist

Go over all the following points, and put an x in all the boxes that apply.

  • I have read the CONTRIBUTION guide (required)
  • I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
  • I have checked if any dependencies need to be added or updated in pyproject.toml and poetry.lock
  • I have updated the tests accordingly (required for a bug fix or a new feature)
  • I have updated the documentation if needed:
  • I have added examples if this is a new feature

Notes for Reviewers

The current data handling of EvolInstruct and SelfInstruct differs and could be improved. Let's discuss how to better align them with a base class?

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

# simulate random scores in range (1, 10) for now
scores = [random.randint(1, 10) for _ in batch_results[1:]] if keep_original else [random.randint(1, 10) for _ in batch_results]
else:
# TODO: implement instruction scoring module, e.g., complexity/quality scorer or by reward advantage
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a future feature on scorer which evaluates instructions, that can be rule-based or by a generative agent. some references:

IN_BREADTH_KEYS = ['persona', 'shift-in', 'shift-out', 'mix', 'abstract']
IN_DEPTH_KEYS = ['constraints', 'deepening', 'concretizing', 'reasoning', 'expansion']

EVOL_METHODS = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notes: we can define more domain-specific templates (e.g., for math/coding/...).

also, currently the evolving happens independently for each prompt (x' ~ LLM( | x, ins)); we should improve this later so that the evolving becomes multi-prompt / group based (x' ~ LLM( | a cluster of x, ins)), where the LLM can crossover and mutate in a group.

regarding the prompt groups -- some time ago, @lightaime mentioned message-passing based sampling. we can also include support for this in our pipeline.

Copy link
Collaborator

@zjrwtx zjrwtx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks for your work @ZIYU-DEEP ,but some docstring need to be polished

self,
agent: ChatAgent,
):
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
r"""

@Wendong-Fan Wendong-Fan added this to the Sprint 25 milestone Mar 10, 2025
Copy link
Collaborator

@zjrwtx zjrwtx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great thanks for your work @ZIYU-DEEP

@Wendong-Fan Wendong-Fan changed the title [feat] add EvolInstruct alike methods to camel/datagen feat: add EvolInstruct alike methods to camel/datagen Mar 11, 2025
@Zhangzeyu97
Copy link
Collaborator

Ziyu @ZIYU-DEEP and I had a discussion about the current evol-instruct and identified the following areas for improvement:

  1. Support for user-defined EvolInstructTemplates: The current templates are designed for general instruction. However, applying evol-instruct to specific domains requires modifications to the meta prompt and evol methods accordingly.

  2. Provide a template example based on the advanced math domain.

  3. Implement an LLM-based scorer to evaluate aspects such as complexity and diversity.

We will collaborate to improve these aspects. If you have any ideas, feel free to discuss them with us!

Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ZIYU-DEEP for the contribution and sorry for the late review, left some comments below, we also need to add unit test to this feature, please run pre-commit run --all-files locally in your terminal to check the pre commit error now existing~


def _set_method(
self,
method: Optional[Union[str, List[str]]] = "uniform",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string could be too general? how about use Literal

method: Optional[Union[Literal["uniform", "in-breadth", "in-depth",....]

def _generate_single(
self,
prompt: str, # for a single prompt
method: str = "uniform",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Literal?

Comment on lines 273 to 279
else:
# TODO: implement instruction scoring module, e.g., complexity/quality scorer or by reward advantage
raise NotImplementedError(f"Scorer '{scorer}' is not implemented.")

# select the prompt with the highest score
best_index = scores.index(max(scores))
current_prompt = batch_results[best_index + 1][0] if keep_original else batch_results[best_index][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems if best_index is the last element in scores (e.g., if the last generated prompt has the highest score), then best_index + 1 would point to an element beyond the bounds of batch_results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep_original will add one more prompt to the list prior to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we move the .ipynb under docs/cookbooks/data_generation?

@zjrwtx zjrwtx self-requested a review March 23, 2025 06:29
@ZIYU-DEEP ZIYU-DEEP marked this pull request as draft March 24, 2025 15:37
@ZIYU-DEEP
Copy link
Contributor Author

Thanks a lot @Wendong-Fan! just resolved some minor issues. converting this PR a draft and handing over to @Zhangzeyu97 to work on the feat/Intergrate-Evol-Instruct branch under the camel repo with new features!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Feature Request] Add EvolInstruct methods to camel/datagen

4 participants