Add text2qa and qa enhancement pipelines for SFT process.#1050
Add text2qa and qa enhancement pipelines for SFT process.#1050CarlosShaoting wants to merge 9 commits intoLazyAGI:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the data processing capabilities for Supervised Fine-Tuning (SFT) by introducing new pipelines and operators. The changes enable more robust generation and enhancement of QA datasets from various sources, including plain text and PDFs with images. This streamlines the creation of high-quality training data for large language models, improving efficiency and flexibility in data preparation workflows. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces new data processing pipelines for generating and enhancing question-answer pairs for supervised fine-tuning (SFT), including text2qa and qa enhancement pipelines with new operators and tests. A critical path traversal vulnerability was identified in the PdfChunkToQA operator, and multiple prompt injection vulnerabilities exist across several operators due to untrusted user input being directly concatenated into LLM prompts. Additionally, the review highlights areas for improving robustness by handling potential errors, fixing inconsistencies in documentation and implementation, and removing duplicated code, specifically addressing potential KeyError and TypeError exceptions and incorrect logic for optional parameters.
| @data_register('data.Text2qa', rewrite_func='forward') | ||
| def wrong_answer_filter(data, input_key, min_score): | ||
| score = data.get(input_key, 0) | ||
| if score >= min_score: | ||
| return None | ||
| return [] |
There was a problem hiding this comment.
The function wrong_answer_filter is a duplicate of qa_score_filter, which is already defined in lazyllm/tools/data/operators/text2qa_ops.py. Additionally, it's being registered under the data.Text2qa group from within math_ops.py, which is confusing. To maintain code clarity and avoid duplication, this function should be removed from this file. Any logic that needs it should import and use Text2qa.qa_score_filter.
| out = self.model(encode_query_with_filepaths(query, local_paths)) | ||
| data[self.query_key] = out.get(self.query_key, '') | ||
| data[self.answer_key] = out.get(self.answer_key, '') | ||
| data[self.image_key] = local_paths[0] | ||
| return data | ||
| user_prompt = self.user_prompt or '根据下面文本生成一个 QA 对:\n' | ||
| inp = f'{user_prompt}\n{chunk}' | ||
| qa = self.model(inp) |
There was a problem hiding this comment.
This section is vulnerable to prompt injection as untrusted user input from the chunk field is directly concatenated into the LLM prompt, allowing attackers to manipulate the LLM's behavior. Additionally, there's an inconsistency where only the first image path (local_paths[0]) is saved to data[self.image_key], but all extracted image paths are passed to encode_query_with_filepaths. Consider storing all relevant image paths in the data item for consistency.
| out = self.model(encode_query_with_filepaths(query, local_paths)) | |
| data[self.query_key] = out.get(self.query_key, '') | |
| data[self.answer_key] = out.get(self.answer_key, '') | |
| data[self.image_key] = local_paths[0] | |
| return data | |
| user_prompt = self.user_prompt or '根据下面文本生成一个 QA 对:\n' | |
| inp = f'{user_prompt}\n{chunk}' | |
| qa = self.model(inp) | |
| data[self.image_key] = local_paths |
| image_rel_paths = self._extract_images(chunk) | ||
| if image_rel_paths: | ||
| base_dir = os.path.join('lazyllm', 'tools', 'data', 'operators', 'imgs') | ||
| os.makedirs(base_dir, exist_ok=True) | ||
| local_paths = [] | ||
| for rel_path in image_rel_paths: | ||
| filename = os.path.basename(rel_path) | ||
| local_path = os.path.join(base_dir, filename) | ||
| local_paths.append(local_path) | ||
| src_path = os.path.join(self.mineru_api, rel_path) if self.mineru_api else rel_path | ||
| if src_path.startswith('http'): | ||
| import requests | ||
| r = requests.get(src_path) | ||
| with open(local_path, 'wb') as f: | ||
| f.write(r.content) | ||
| else: | ||
| import shutil | ||
| shutil.copy(src_path, local_path) |
There was a problem hiding this comment.
The PdfChunkToQA.forward method extracts image paths from the chunk input using a regular expression and then uses these paths in shutil.copy without proper validation. An attacker can provide a specially crafted chunk containing a path like  to read arbitrary files from the server. These files are then potentially exfiltrated through the LLM response.
| user_prompt = self.user_prompt or f''' | ||
| 请根据下面内容和图片(可以没有图片)对 QA 打分: | ||
|
|
||
| 原文: | ||
| {chunk} | ||
|
|
||
| 图片路径: | ||
| {img_path} | ||
|
|
||
| {qa_payload} | ||
|
|
||
| 规则: | ||
| - 严格基于原文和图片 → 1 | ||
| - 否则 → 0 | ||
| ''' | ||
| res = self.model(encode_query_with_filepaths(user_prompt, [img_path])) |
| if not (chunk and query and answer and img_path): | ||
| data[self.output_key] = 0 | ||
| return data |
There was a problem hiding this comment.
The condition if not (chunk and query and answer and img_path): incorrectly makes img_path a required field. However, the prompt indicates that an image is optional ("可以没有图片"). This check will cause samples without images to be incorrectly assigned a score of 0. The check should only validate the required fields.
| if not (chunk and query and answer and img_path): | |
| data[self.output_key] = 0 | |
| return data | |
| if not (chunk and query and answer): | |
| data[self.output_key] = 0 | |
| return data |
| 筛选样本的算子。 | ||
|
|
||
| - 如果输入字段为 True,则表示样本正确,保留原始数据用于后续处理。 | ||
| - 如果输入字段为 False,则表示样本错误,返回 None,被过滤掉。 |
There was a problem hiding this comment.
This line in the documentation is incorrect. When the input field is False, the function returns [] to filter out the sample, not None. Also, returning None means the sample is kept, not filtered out, according to the project's data processing rules.
| - 如果输入字段为 False,则表示样本错误,返回 None,被过滤掉。 | |
| - 如果输入字段为 False,则表示样本错误,返回 [],被过滤掉。 |
References
- In LazyLLM's data processing pipeline, a filter function returning
Noneindicates that the data item should be kept, while returning an empty list ([]) indicates it should be discarded.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Silently passing on Exception can hide issues during data processing. It's better to log a warning when a value cannot be converted to a float, which would help with debugging and ensure sufficient visibility and context for key operations, as per project logging rules.
| except Exception: | |
| pass | |
| except Exception: | |
| LOG.warning(f"Could not convert value to float in multi_features_filter for item: {x}") |
References
- For key operations, favor detailed
INFOlevel logs over conciseDEBUGlogs to ensure sufficient visibility and context.
| Sample filtering operator. | ||
|
|
||
| - If the specified field is True, the sample is considered correct and the original data is retained for further processing. | ||
| - If the specified field is False, the sample is considered wrong and returns None (filtered out). |
There was a problem hiding this comment.
This line in the documentation is incorrect. When the input field is False, the function returns [] to filter out the sample, not None. Also, returning None means the sample is kept, not filtered out, according to the project's data processing rules.
| - If the specified field is False, the sample is considered wrong and returns None (filtered out). | |
| - If the specified field is False, the sample is considered wrong and returns [] (filtered out). |
References
- In LazyLLM's data processing pipeline, a filter function returning
Noneindicates that the data item should be kept, while returning an empty list ([]) indicates it should be discarded.
| except Exception: | ||
| data[output_key] = False |
There was a problem hiding this comment.
Removing the error logging here reduces visibility into potential issues during answer verification. Swallowing the exception without logging can make debugging difficult. According to the project's general rules, key operations should have detailed logs to ensure sufficient visibility and context. Please consider re-adding the logging, which will also require re-importing LOG at the top of the file.
| except Exception: | |
| data[output_key] = False | |
| except Exception as e: | |
| from lazyllm import LOG | |
| LOG.error(f'Error verifying answers: {e}') | |
| data[output_key] = False |
References
- For key operations, favor detailed
INFOlevel logs over conciseDEBUGlogs to ensure sufficient visibility and context.
| print(op(data1) # None, kept | ||
| print(op(data2) # [], filtered out |
There was a problem hiding this comment.
📌 PR 内容 / PR Description
✅ 变更类型 / Type of Change
🧪 如何测试 / How Has This Been Tested?
📷 截图 / Demo (Optional)
Todo:
补充 剩余ppl,渲染截图