Add codegen_pipeline、preference_pipeline、text2sql_pipeline、tool_use_pipeline#1049
Add codegen_pipeline、preference_pipeline、text2sql_pipeline、tool_use_pipeline#1049ChongjinHuang wants to merge 9 commits intoLazyAGI:dev/pipelinefrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the data processing capabilities of the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This PR introduces several new data processing pipelines (codegen, preference, text2sql, tool_use), along with corresponding operators and tests, and includes refactoring of existing operators like ThresholdSieve. Documentation has also been updated. A critical high-severity SQL injection vulnerability was identified in the SQLEffortRanker operator within text2sql_ops.py, where LLM-generated and user-provided SQL queries are executed without proper validation. Furthermore, general issues include fragile data processing in preference_ops and a critical bug in tool_use_pipeline due to key and type mismatches between operators.
| ppl.dialogue_simulator = tool_use_ops.DialogueSimulator( | ||
| model=model, | ||
| input_composition_key='composition_task', | ||
| input_functions_key='functions', | ||
| output_key='conversation', | ||
| n_turns=n_turns | ||
| ) |
There was a problem hiding this comment.
build_tool_use_pipeline 的定义中存在一个关键的 bug。dialogue_simulator 算子在初始化时默认需要一个名为 composition_task 的输入键,但流水线中前面的算子并没有生成这个键。在此阶段,可用的相关键是 filtered_composition_tasks(来自 viability_sieve)。这个键名不匹配将导致流水线失败或产生不正确的结果。
此外,似乎还存在类型不匹配的问题。viability_sieve 输出的是一个任务列表,而 dialogue_simulator 可能期望的是单个任务字符串。流水线需要修正以处理任务列表,例如通过迭代它们。
要解决直接的键名不匹配问题,您应该更新 input_composition_key。
input_composition_key='filtered_composition_tasks',| # 使用循环调用模型,避免传入列表导致类型错误 | ||
| responses = [] | ||
| for _ in range(self.num_generations): | ||
| response = self.model(prompt) |
There was a problem hiding this comment.
The SQL queries generated by the LLM in this loop are subsequently parsed and executed by self.database_manager.batch_compare_queries (line 846) without any sanitization or validation. This presents a significant SQL injection risk if the LLM is manipulated via prompt injection to generate malicious SQL commands. Additionally, the ground_truth SQL query retrieved from the input data (line 820) is also executed without validation.
Recommendation: Implement strict validation and sanitization for all SQL queries before execution. Ensure that the database environment used for these comparisons is properly sandboxed and that the database user has minimal privileges (e.g., read-only access to non-sensitive tables).
| instruction = data.get(self.instruction_key, '') | ||
| if isinstance(instruction, dict): | ||
| instruction = list(instruction.values())[0] if instruction else '' | ||
| elif isinstance(instruction, list): | ||
| if len(instruction) > 0: | ||
| first_item = instruction[0] | ||
| if isinstance(first_item, dict): | ||
| if 'intent' in first_item: | ||
| instruction = str(first_item['intent']) | ||
| elif 'description' in first_item: | ||
| instruction = str(first_item['description']) | ||
| else: | ||
| instruction = str(first_item) | ||
| else: | ||
| instruction = str(first_item) | ||
| else: | ||
| instruction = '' | ||
| elif not isinstance(instruction, str): | ||
| instruction = str(instruction) | ||
|
|
There was a problem hiding this comment.

📌 PR 内容 / PR Description
✅ 变更类型 / Type of Change
🧪 如何测试 / How Has This Been Tested?