Skip to content

Add text2qa and qa enhancement pipelines for SFT process.#1050

Open
CarlosShaoting wants to merge 9 commits intoLazyAGI:mainfrom
CarlosShaoting:cst/all_ppls
Open

Add text2qa and qa enhancement pipelines for SFT process.#1050
CarlosShaoting wants to merge 9 commits intoLazyAGI:mainfrom
CarlosShaoting:cst/all_ppls

Conversation

@CarlosShaoting
Copy link
Contributor

@CarlosShaoting CarlosShaoting commented Mar 6, 2026

📌 PR 内容 / PR Description

  • 添加 文本生成qa对 和 qa对重写强化 流水线

✅ 变更类型 / Type of Change

  • 修复 Bug / Bug fix (non-breaking change that fixes an issue)
  • 新功能 / New feature (non-breaking change that adds functionality)
  • 重构 / Refactor (no functionality change, code structure optimized)
  • 重大变更 / Breaking change (fix or feature that would cause existing functionality to change)
  • 文档更新 / Documentation update (changes to docs only)
  • 性能优化 / Performance optimization

🧪 如何测试 / How Has This Been Tested?

  1. pytest LazyLLM/tests/data_tests/Pipeline/test_text2qa_ppl.py -v
  2. pytest LazyLLM/tests/data_tests/Pipeline/test_enqa_ppl.py -v

📷 截图 / Demo (Optional)

image image image

Todo:

补充 剩余ppl,渲染截图

@CarlosShaoting CarlosShaoting requested review from a team as code owners March 6, 2026 09:14
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the data processing capabilities for Supervised Fine-Tuning (SFT) by introducing new pipelines and operators. The changes enable more robust generation and enhancement of QA datasets from various sources, including plain text and PDFs with images. This streamlines the creation of high-quality training data for large language models, improving efficiency and flexibility in data preparation workflows.

Highlights

  • New SFT Data Pipelines: Introduced two new data processing pipelines for Supervised Fine-Tuning (SFT): build_text2qa_pipeline for generating QA pairs from text and build_enhance_qa_pipeline for enhancing existing QA pairs through query rewriting and diversity scoring.
  • Enhanced PDF QA Generation: Added new operators and a pipeline (build_pdf2qa_pipeline) to support QA generation from PDF content, including image extraction and scoring of generated QA pairs.
  • New Data Operators: Implemented several new data operators such as wrong_filter, wrong_answer_filter, multi_features_filter, qa_score_filter, to_alpaca_sft, and to_chat_sft to provide more granular control over data processing, filtering, and format conversion.
  • Refactored Core Operators: Refactored existing operators in cot_ops.py, math_ops.py, and text2qa_ops.py to improve logic, error handling, and leverage common utility functions for tasks like boxed result extraction.
  • Comprehensive Pipeline Exports: Updated the lazyllm/tools/data/pipelines/__init__.py to export all new and existing data pipelines, making them easily accessible.
  • New Test Coverage: Added new test files (test_enqa_ppl.py and test_text2qa_ppl.py) to ensure the correctness and stability of the newly introduced text2qa and qa enhancement pipelines.
Changelog
  • .gitignore
    • Updated to ignore data pipeline results, image directories, and common image/document files.
  • lazyllm/docs/data_process.py
    • Extended documentation for new data operators and pipelines, including Chinese and English descriptions and examples.
  • lazyllm/tools/data/operators/cot_ops.py
    • Refactored answer_verify for direct equality checks and generalized error handling.
    • Introduced a wrong_filter operator.
  • lazyllm/tools/data/operators/enQa_ops.py
    • Adjusted whitespace in the DiversityScorer's forward method.
  • lazyllm/tools/data/operators/math_ops.py
    • Streamlined imports and updated default model name.
    • Refactored math_answer_extractor to use a common utility for boxed result extraction.
  • lazyllm/tools/data/operators/pdf_ops.py
    • Enhanced PDF processing by modifying Pdf2Md to return structured content.
    • Added a multi_features_filter operator.
    • Introduced PdfChunkToQA and PdfQAScorer for advanced PDF-based QA generation and scoring.
  • lazyllm/tools/data/operators/pt_op.py
    • Modified VQAGenerator to handle empty context and image paths more robustly.
  • lazyllm/tools/data/operators/text2qa_ops.py
    • Refactored imports and updated default model name.
    • Enhanced ChunkToQA with a default prompt and improved prompt construction.
    • Refined QAScorer's logic and output, and cast score to float.
    • Added qa_score_filter, to_alpaca_sft, and to_chat_sft operators.
  • lazyllm/tools/data/pipelines/init.py
    • Expanded the __all__ export list to include newly added data pipeline modules.
  • lazyllm/tools/data/pipelines/cot_pipelines.py
    • Added a new pipeline for Chain-of-Thought processing, including generation, extraction, and optional verification.
  • lazyllm/tools/data/pipelines/enhance_pipelines.py
    • Added a new pipeline for enhancing QA data through query rewriting, diversity scoring, and SFT format conversion.
  • lazyllm/tools/data/pipelines/img_pipelines.py
    • Added a new pipeline for image-to-QA generation, scoring, and filtering.
  • lazyllm/tools/data/pipelines/math_pipelines.py
    • Added a new pipeline for mathematical Chain-of-Thought processing, incorporating generation, extraction, verification, and quality evaluation.
  • lazyllm/tools/data/pipelines/pdf_pipelines.py
    • Added a new pipeline for PDF-to-QA generation, including PDF parsing, chunking, QA generation, and scoring.
  • lazyllm/tools/data/pipelines/text_pipelines.py
    • Added a new pipeline for text-to-QA generation, encompassing text chunking, filtering, QA generation, scoring, and SFT format conversion.
  • tests/data_tests/Pipeline/test_enqa_ppl.py
    • Added new unit tests for the enhanced QA pipeline.
  • tests/data_tests/Pipeline/test_text2qa_ppl.py
    • Added new unit tests for the text-to-QA pipeline.
Activity
  • CarlosShaoting created the pull request to add text2qa and qa enhancement pipelines for the SFT process.
  • CarlosShaoting provided two screenshots demonstrating the functionality of the new pipelines.
  • CarlosShaoting listed two pytest commands for testing the new pipelines: pytest LazyLLM/tests/data_tests/Pipeline/test_text2qa_ppl.py -v and pytest LazyLLM/tests/data_tests/Pipeline/test_enqa_ppl.py -v.
  • CarlosShaoting indicated that remaining pipelines and rendering screenshots are still to be supplemented.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new data processing pipelines for generating and enhancing question-answer pairs for supervised fine-tuning (SFT), including text2qa and qa enhancement pipelines with new operators and tests. A critical path traversal vulnerability was identified in the PdfChunkToQA operator, and multiple prompt injection vulnerabilities exist across several operators due to untrusted user input being directly concatenated into LLM prompts. Additionally, the review highlights areas for improving robustness by handling potential errors, fixing inconsistencies in documentation and implementation, and removing duplicated code, specifically addressing potential KeyError and TypeError exceptions and incorrect logic for optional parameters.

Comment on lines +429 to +434
@data_register('data.Text2qa', rewrite_func='forward')
def wrong_answer_filter(data, input_key, min_score):
score = data.get(input_key, 0)
if score >= min_score:
return None
return []

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function wrong_answer_filter is a duplicate of qa_score_filter, which is already defined in lazyllm/tools/data/operators/text2qa_ops.py. Additionally, it's being registered under the data.Text2qa group from within math_ops.py, which is confusing. To maintain code clarity and avoid duplication, this function should be removed from this file. Any logic that needs it should import and use Text2qa.qa_score_filter.

Comment on lines +130 to +137
out = self.model(encode_query_with_filepaths(query, local_paths))
data[self.query_key] = out.get(self.query_key, '')
data[self.answer_key] = out.get(self.answer_key, '')
data[self.image_key] = local_paths[0]
return data
user_prompt = self.user_prompt or '根据下面文本生成一个 QA 对:\n'
inp = f'{user_prompt}\n{chunk}'
qa = self.model(inp)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

This section is vulnerable to prompt injection as untrusted user input from the chunk field is directly concatenated into the LLM prompt, allowing attackers to manipulate the LLM's behavior. Additionally, there's an inconsistency where only the first image path (local_paths[0]) is saved to data[self.image_key], but all extracted image paths are passed to encode_query_with_filepaths. Consider storing all relevant image paths in the data item for consistency.

Suggested change
out = self.model(encode_query_with_filepaths(query, local_paths))
data[self.query_key] = out.get(self.query_key, '')
data[self.answer_key] = out.get(self.answer_key, '')
data[self.image_key] = local_paths[0]
return data
user_prompt = self.user_prompt or '根据下面文本生成一个 QA 对:\n'
inp = f'{user_prompt}\n{chunk}'
qa = self.model(inp)
data[self.image_key] = local_paths

Comment on lines +109 to +126
image_rel_paths = self._extract_images(chunk)
if image_rel_paths:
base_dir = os.path.join('lazyllm', 'tools', 'data', 'operators', 'imgs')
os.makedirs(base_dir, exist_ok=True)
local_paths = []
for rel_path in image_rel_paths:
filename = os.path.basename(rel_path)
local_path = os.path.join(base_dir, filename)
local_paths.append(local_path)
src_path = os.path.join(self.mineru_api, rel_path) if self.mineru_api else rel_path
if src_path.startswith('http'):
import requests
r = requests.get(src_path)
with open(local_path, 'wb') as f:
f.write(r.content)
else:
import shutil
shutil.copy(src_path, local_path)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The PdfChunkToQA.forward method extracts image paths from the chunk input using a regular expression and then uses these paths in shutil.copy without proper validation. An attacker can provide a specially crafted chunk containing a path like ![alt](images/../../../../etc/passwd) to read arbitrary files from the server. These files are then potentially exfiltrated through the LLM response.

Comment on lines +186 to +201
user_prompt = self.user_prompt or f'''
请根据下面内容和图片(可以没有图片)对 QA 打分:

原文:
{chunk}

图片路径:
{img_path}

{qa_payload}

规则:
- 严格基于原文和图片 → 1
- 否则 → 0
'''
res = self.model(encode_query_with_filepaths(user_prompt, [img_path]))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The PdfQAScorer operator concatenates untrusted input from chunk, query, and answer fields into the prompt sent to the LLM, making it vulnerable to prompt injection.

Comment on lines +182 to +184
if not (chunk and query and answer and img_path):
data[self.output_key] = 0
return data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The condition if not (chunk and query and answer and img_path): incorrectly makes img_path a required field. However, the prompt indicates that an image is optional ("可以没有图片"). This check will cause samples without images to be incorrectly assigned a score of 0. The check should only validate the required fields.

Suggested change
if not (chunk and query and answer and img_path):
data[self.output_key] = 0
return data
if not (chunk and query and answer):
data[self.output_key] = 0
return data

筛选样本的算子。

- 如果输入字段为 True,则表示样本正确,保留原始数据用于后续处理。
- 如果输入字段为 False,则表示样本错误,返回 None,被过滤掉。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line in the documentation is incorrect. When the input field is False, the function returns [] to filter out the sample, not None. Also, returning None means the sample is kept, not filtered out, according to the project's data processing rules.

Suggested change
- 如果输入字段为 False则表示样本错误返回 None被过滤掉
- 如果输入字段为 False则表示样本错误返回 []被过滤掉
References
  1. In LazyLLM's data processing pipeline, a filter function returning None indicates that the data item should be kept, while returning an empty list ([]) indicates it should be discarded.

Comment on lines +56 to +57
except Exception:
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Silently passing on Exception can hide issues during data processing. It's better to log a warning when a value cannot be converted to a float, which would help with debugging and ensure sufficient visibility and context for key operations, as per project logging rules.

Suggested change
except Exception:
pass
except Exception:
LOG.warning(f"Could not convert value to float in multi_features_filter for item: {x}")
References
  1. For key operations, favor detailed INFO level logs over concise DEBUG logs to ensure sufficient visibility and context.

Sample filtering operator.

- If the specified field is True, the sample is considered correct and the original data is retained for further processing.
- If the specified field is False, the sample is considered wrong and returns None (filtered out).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line in the documentation is incorrect. When the input field is False, the function returns [] to filter out the sample, not None. Also, returning None means the sample is kept, not filtered out, according to the project's data processing rules.

Suggested change
- If the specified field is False, the sample is considered wrong and returns None (filtered out).
- If the specified field is False, the sample is considered wrong and returns [] (filtered out).
References
  1. In LazyLLM's data processing pipeline, a filter function returning None indicates that the data item should be kept, while returning an empty list ([]) indicates it should be discarded.

Comment on lines +152 to 153
except Exception:
data[output_key] = False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Removing the error logging here reduces visibility into potential issues during answer verification. Swallowing the exception without logging can make debugging difficult. According to the project's general rules, key operations should have detailed logs to ensure sufficient visibility and context. Please consider re-adding the logging, which will also require re-importing LOG at the top of the file.

Suggested change
except Exception:
data[output_key] = False
except Exception as e:
from lazyllm import LOG
LOG.error(f'Error verifying answers: {e}')
data[output_key] = False
References
  1. For key operations, favor detailed INFO level logs over concise DEBUG logs to ensure sufficient visibility and context.

Comment on lines +7060 to +7061
print(op(data1) # None, kept
print(op(data2) # [], filtered out

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a syntax error in this example code. The print statements are missing their closing parentheses.

Suggested change
print(op(data1) # None, kept
print(op(data2) # [], filtered out
print(op(data1)) # None, kept
print(op(data2)) # [], filtered out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant