Skip to content

Add codegen_pipeline、preference_pipeline、text2sql_pipeline、tool_use_pipeline#1049

Open
ChongjinHuang wants to merge 9 commits intoLazyAGI:dev/pipelinefrom
ChongjinHuang:hcj/ppl_new
Open

Add codegen_pipeline、preference_pipeline、text2sql_pipeline、tool_use_pipeline#1049
ChongjinHuang wants to merge 9 commits intoLazyAGI:dev/pipelinefrom
ChongjinHuang:hcj/ppl_new

Conversation

@ChongjinHuang
Copy link
Contributor

@ChongjinHuang ChongjinHuang commented Mar 6, 2026

📌 PR 内容 / PR Description

  • 添加codegen_pipeline、preference_pipeline、text2sql_pipeline、tool_use_pipeline以及对应的测试

✅ 变更类型 / Type of Change

  • 修复 Bug / Bug fix (non-breaking change that fixes an issue)
  • 新功能 / New feature (non-breaking change that adds functionality)
  • 重构 / Refactor (no functionality change, code structure optimized)
  • 重大变更 / Breaking change (fix or feature that would cause existing functionality to change)
  • 文档更新 / Documentation update (changes to docs only)
  • 性能优化 / Performance optimization

🧪 如何测试 / How Has This Been Tested?

pytest -vs tests/data_tests/Pipeline/test_codegen_ppl.py
pytest -vs tests/data_tests/Pipeline/test_preference_ppl.py
pytest -vs tests/data_tests/Pipeline/test_text2sql_ppl.py
pytest -vs tests/data_tests/Pipeline/test_tool_use_ppl.py

⚠️ 注意事项 / Additional Notes

@ChongjinHuang ChongjinHuang requested review from a team as code owners March 6, 2026 07:46
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the data processing capabilities of the lazyllm framework by introducing a suite of specialized pipelines. These additions enable the generation of structured and high-quality training data for advanced LLM applications such as code generation, preference learning, Text-to-SQL conversion, and tool-use scenarios. The changes also include refactorings and new operators to support these pipelines, enhancing the framework's overall data preparation toolkit.

Highlights

  • New Data Processing Pipelines: Introduced four new data processing pipelines: codegen_pipeline, preference_pipeline, text2sql_pipeline, and tool_use_pipeline, designed to generate high-quality training data for various LLM tasks.
  • Codegen Operator Enhancements: Refactored CodeInstructionGenerator, ScriptSynthesizer, LogicIntegrityAuditor, and ThresholdSieve operators, including changes to output keys, removal of the model argument from ThresholdSieve, and addition of _save_data=False for efficiency.
  • New Code Feedback Formatter: Added a new CodeFeedbackFormatter operator to standardize the output format of code instructions, generated code, and feedback into a unified training data structure.
  • Preference Operator Improvements: Updated IntentExtractor and PreferencePairConstructor to improve intent extraction robustness and handle diverse instruction formats for preference data generation.
  • Text2SQL Operator Refinements: Modified SQLIntentSynthesizer to use 'question' instead of 'intent' and enforce a stricter output format, along with adjustments in other Text2SQL operators for consistency and robustness.
  • Comprehensive Test Coverage: Added dedicated test files for each new pipeline (codegen, preference, text2sql, tool-use) to ensure their functionality and integration.
Changelog
  • lazyllm/docs/data_process.py
    • Updated documentation for CodeInstructionGenerator, ScriptSynthesizer, and LogicIntegrityAuditor to reflect changes in output/input key names.
    • Revised ThresholdSieve documentation to clarify its function as a score-based filter without requiring a model.
    • Added comprehensive documentation for the new CodeFeedbackFormatter operator.
    • Included detailed documentation for the newly introduced codegen_pipelines, preference_pipelines, text2sql_pipelines, and tool_use_pipelines.
  • lazyllm/tools/data/operators/codegen_ops.py
    • Modified CodeInstructionGenerator, ScriptSynthesizer, and LogicIntegrityAuditor constructors to include _save_data=False for performance.
    • Refactored CodeInstructionGenerator to default output_key to 'instruction'.
    • Updated ThresholdSieve to remove the model argument and internal scorer, now directly using an input_score_key.
    • Added a new CodeFeedbackFormatter class to format code generation outputs into a structured training data format.
  • lazyllm/tools/data/operators/preference_ops.py
    • Updated IntentExtractor's system prompt to guide the model towards direct JSON output without verbose thinking processes and to return a single key-value pair.
    • Improved IntentExtractor.extract to handle various model response formats (dict, list) and ensure a string output.
    • Modified ResponseEvaluator's system prompt to prevent verbose model responses.
    • Enhanced PreferencePairConstructor to robustly extract instruction from different data structures (dict, list, string).
  • lazyllm/tools/data/operators/text2sql_ops.py
    • Imported JsonFormatter for potential future use or consistency.
    • Changed the default input_intent_key in SQLIntentSynthesizer to 'question' and updated its prompt to generate natural language questions with strict [QUESTION-START] and [EXTERNAL-KNOWLEDGE-START] tags.
    • Adjusted SQLIntentSynthesizer.forward to store the generated question under the 'question' key.
    • Updated input_intent_key references in SQLContextAssembler and SQLReasoningTracer to 'question'.
    • Modified SQLEffortRanker.forward to iterate model calls for num_generations to ensure correct response handling.
  • lazyllm/tools/data/pipelines/init.py
    • Imported new pipeline building functions: build_tool_use_pipeline, build_simple_tool_use_pipeline, build_text2sql_full_pipeline, build_preference_pipeline, and build_codegen_pipeline.
    • Exported the new pipeline building functions in the __all__ list.
  • lazyllm/tools/data/pipelines/codegen_pipelines.py
    • Added new file containing build_codegen_pipeline for full code generation data processing and build_simple_codegen_pipeline for basic instruction generation and code synthesis.
  • lazyllm/tools/data/pipelines/preference_pipelines.py
    • Added new file containing build_preference_pipeline for constructing preference training data.
  • lazyllm/tools/data/pipelines/text2sql_pipelines.py
    • Added new file containing build_text2sql_full_pipeline for generating comprehensive Text2SQL training data.
  • lazyllm/tools/data/pipelines/tool_use_pipelines.py
    • Added new file containing build_tool_use_pipeline for generating tool calling training data and build_simple_tool_use_pipeline for a simplified tool-use scenario.
  • tests/data_tests/Operators/test_codegen_ops.py
    • Updated test_code_enhancement_instruction_generator to reflect the output_key change from 'generated_instruction' to 'instruction'.
    • Modified test_code_quality_score_filter to align with the refactored ThresholdSieve (no model argument, direct score input).
    • Added test_code_feedback_formatter to validate the functionality of the new CodeFeedbackFormatter.
  • tests/data_tests/Pipeline/test_codegen_ppl.py
    • Added new test file to verify the build_codegen_pipeline functionality, including filtering and successful data generation scenarios.
  • tests/data_tests/Pipeline/test_preference_ppl.py
    • Added new test file to validate the build_preference_pipeline for preference data generation.
  • tests/data_tests/Pipeline/test_text2sql_ppl.py
    • Added new test file to test the build_text2sql_full_pipeline with mock database and model interactions.
  • tests/data_tests/Pipeline/test_tool_use_ppl.py
    • Added new test file to test both build_tool_use_pipeline and build_simple_tool_use_pipeline for tool-use data generation.
Activity
  • ririqing0614 created this pull request to add new data processing pipelines and their corresponding tests.
  • The pull request is marked as a 'New feature' indicating significant additions to the framework's capabilities.
  • The author provided specific pytest commands for testing each new pipeline, demonstrating thoroughness in verification.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces several new data processing pipelines (codegen, preference, text2sql, tool_use), along with corresponding operators and tests, and includes refactoring of existing operators like ThresholdSieve. Documentation has also been updated. A critical high-severity SQL injection vulnerability was identified in the SQLEffortRanker operator within text2sql_ops.py, where LLM-generated and user-provided SQL queries are executed without proper validation. Furthermore, general issues include fragile data processing in preference_ops and a critical bug in tool_use_pipeline due to key and type mismatches between operators.

Comment on lines +44 to +50
ppl.dialogue_simulator = tool_use_ops.DialogueSimulator(
model=model,
input_composition_key='composition_task',
input_functions_key='functions',
output_key='conversation',
n_turns=n_turns
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

build_tool_use_pipeline 的定义中存在一个关键的 bug。dialogue_simulator 算子在初始化时默认需要一个名为 composition_task 的输入键,但流水线中前面的算子并没有生成这个键。在此阶段,可用的相关键是 filtered_composition_tasks(来自 viability_sieve)。这个键名不匹配将导致流水线失败或产生不正确的结果。

此外,似乎还存在类型不匹配的问题。viability_sieve 输出的是一个任务列表,而 dialogue_simulator 可能期望的是单个任务字符串。流水线需要修正以处理任务列表,例如通过迭代它们。

要解决直接的键名不匹配问题,您应该更新 input_composition_key

            input_composition_key='filtered_composition_tasks',

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

# 使用循环调用模型,避免传入列表导致类型错误
responses = []
for _ in range(self.num_generations):
response = self.model(prompt)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The SQL queries generated by the LLM in this loop are subsequently parsed and executed by self.database_manager.batch_compare_queries (line 846) without any sanitization or validation. This presents a significant SQL injection risk if the LLM is manipulated via prompt injection to generate malicious SQL commands. Additionally, the ground_truth SQL query retrieved from the input data (line 820) is also executed without validation.

Recommendation: Implement strict validation and sanitization for all SQL queries before execution. Ensure that the database environment used for these comparisons is properly sandboxed and that the database user has minimal privileges (e.g., read-only access to non-sensitive tables).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Comment on lines +131 to +150
instruction = data.get(self.instruction_key, '')
if isinstance(instruction, dict):
instruction = list(instruction.values())[0] if instruction else ''
elif isinstance(instruction, list):
if len(instruction) > 0:
first_item = instruction[0]
if isinstance(first_item, dict):
if 'intent' in first_item:
instruction = str(first_item['intent'])
elif 'description' in first_item:
instruction = str(first_item['description'])
else:
instruction = str(first_item)
else:
instruction = str(first_item)
else:
instruction = ''
elif not isinstance(instruction, str):
instruction = str(instruction)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这部分用于处理 instruction 字段的逻辑过于复杂且脆弱。它试图通过猜测结构来处理多种数据类型(字典、列表),这可能导致意外行为且难以维护。鉴于上游的 IntentExtractor 算子应该返回一个字符串,此处的复杂处理逻辑应当被简化。算子之间最好强制执行明确的数据约定。如果 IntentExtractor 可能会失败并返回其他类型,理想情况下应该在更靠近源头的地方处理,或者此算子应该有一种比基于结构猜测(例如 list(instruction.values())[0])更清晰、更健壮的处理方式。请考虑简化此逻辑,期望输入为字符串,并在类型不正确时记录警告或错误。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

@ChongjinHuang
Copy link
Contributor Author

pipeline渲染图
截屏2026-03-06 18 36 03

@ChongjinHuang ChongjinHuang changed the base branch from main to dev/pipeline March 9, 2026 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant