Add text2qa and qa enhancement pipelines for SFT process. by CarlosShaoting · Pull Request #1050 · LazyAGI/LazyLLM

CarlosShaoting · 2026-03-06T09:14:54Z

📌 PR 内容 / PR Description

添加文本生成qa对和 qa对重写强化流水线

✅ 变更类型 / Type of Change

修复 Bug / Bug fix (non-breaking change that fixes an issue)
新功能 / New feature (non-breaking change that adds functionality)
重构 / Refactor (no functionality change, code structure optimized)
重大变更 / Breaking change (fix or feature that would cause existing functionality to change)
文档更新 / Documentation update (changes to docs only)
性能优化 / Performance optimization

🧪 如何测试 / How Has This Been Tested?

pytest LazyLLM/tests/data_tests/Pipeline/test_text2qa_ppl.py -v
pytest LazyLLM/tests/data_tests/Pipeline/test_enqa_ppl.py -v

📷 截图 / Demo (Optional)

Todo:

补充剩余ppl，渲染截图

gemini-code-assist · 2026-03-06T09:15:31Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the data processing capabilities for Supervised Fine-Tuning (SFT) by introducing new pipelines and operators. The changes enable more robust generation and enhancement of QA datasets from various sources, including plain text and PDFs with images. This streamlines the creation of high-quality training data for large language models, improving efficiency and flexibility in data preparation workflows.

Highlights

New SFT Data Pipelines: Introduced two new data processing pipelines for Supervised Fine-Tuning (SFT): build_text2qa_pipeline for generating QA pairs from text and build_enhance_qa_pipeline for enhancing existing QA pairs through query rewriting and diversity scoring.
Enhanced PDF QA Generation: Added new operators and a pipeline (build_pdf2qa_pipeline) to support QA generation from PDF content, including image extraction and scoring of generated QA pairs.
New Data Operators: Implemented several new data operators such as wrong_filter, wrong_answer_filter, multi_features_filter, qa_score_filter, to_alpaca_sft, and to_chat_sft to provide more granular control over data processing, filtering, and format conversion.
Refactored Core Operators: Refactored existing operators in cot_ops.py, math_ops.py, and text2qa_ops.py to improve logic, error handling, and leverage common utility functions for tasks like boxed result extraction.
Comprehensive Pipeline Exports: Updated the lazyllm/tools/data/pipelines/__init__.py to export all new and existing data pipelines, making them easily accessible.
New Test Coverage: Added new test files (test_enqa_ppl.py and test_text2qa_ppl.py) to ensure the correctness and stability of the newly introduced text2qa and qa enhancement pipelines.

Changelog

.gitignore
- Updated to ignore data pipeline results, image directories, and common image/document files.
lazyllm/docs/data_process.py
- Extended documentation for new data operators and pipelines, including Chinese and English descriptions and examples.
lazyllm/tools/data/operators/cot_ops.py
- Refactored answer_verify for direct equality checks and generalized error handling.
- Introduced a wrong_filter operator.
lazyllm/tools/data/operators/enQa_ops.py
- Adjusted whitespace in the DiversityScorer's forward method.
lazyllm/tools/data/operators/math_ops.py
- Streamlined imports and updated default model name.
- Refactored math_answer_extractor to use a common utility for boxed result extraction.
lazyllm/tools/data/operators/pdf_ops.py
- Enhanced PDF processing by modifying Pdf2Md to return structured content.
- Added a multi_features_filter operator.
- Introduced PdfChunkToQA and PdfQAScorer for advanced PDF-based QA generation and scoring.
lazyllm/tools/data/operators/pt_op.py
- Modified VQAGenerator to handle empty context and image paths more robustly.
lazyllm/tools/data/operators/text2qa_ops.py
- Refactored imports and updated default model name.
- Enhanced ChunkToQA with a default prompt and improved prompt construction.
- Refined QAScorer's logic and output, and cast score to float.
- Added qa_score_filter, to_alpaca_sft, and to_chat_sft operators.
lazyllm/tools/data/pipelines/init.py
- Expanded the __all__ export list to include newly added data pipeline modules.
lazyllm/tools/data/pipelines/cot_pipelines.py
- Added a new pipeline for Chain-of-Thought processing, including generation, extraction, and optional verification.
lazyllm/tools/data/pipelines/enhance_pipelines.py
- Added a new pipeline for enhancing QA data through query rewriting, diversity scoring, and SFT format conversion.
lazyllm/tools/data/pipelines/img_pipelines.py
- Added a new pipeline for image-to-QA generation, scoring, and filtering.
lazyllm/tools/data/pipelines/math_pipelines.py
- Added a new pipeline for mathematical Chain-of-Thought processing, incorporating generation, extraction, verification, and quality evaluation.
lazyllm/tools/data/pipelines/pdf_pipelines.py
- Added a new pipeline for PDF-to-QA generation, including PDF parsing, chunking, QA generation, and scoring.
lazyllm/tools/data/pipelines/text_pipelines.py
- Added a new pipeline for text-to-QA generation, encompassing text chunking, filtering, QA generation, scoring, and SFT format conversion.
tests/data_tests/Pipeline/test_enqa_ppl.py
- Added new unit tests for the enhanced QA pipeline.
tests/data_tests/Pipeline/test_text2qa_ppl.py
- Added new unit tests for the text-to-QA pipeline.

Activity

CarlosShaoting created the pull request to add text2qa and qa enhancement pipelines for the SFT process.
CarlosShaoting provided two screenshots demonstrating the functionality of the new pipelines.
CarlosShaoting listed two pytest commands for testing the new pipelines: pytest LazyLLM/tests/data_tests/Pipeline/test_text2qa_ppl.py -v and pytest LazyLLM/tests/data_tests/Pipeline/test_enqa_ppl.py -v.
CarlosShaoting indicated that remaining pipelines and rendering screenshots are still to be supplemented.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces new data processing pipelines for generating and enhancing question-answer pairs for supervised fine-tuning (SFT), including text2qa and qa enhancement pipelines with new operators and tests. A critical path traversal vulnerability was identified in the PdfChunkToQA operator, and multiple prompt injection vulnerabilities exist across several operators due to untrusted user input being directly concatenated into LLM prompts. Additionally, the review highlights areas for improving robustness by handling potential errors, fixing inconsistencies in documentation and implementation, and removing duplicated code, specifically addressing potential KeyError and TypeError exceptions and incorrect logic for optional parameters.

gemini-code-assist · 2026-03-06T09:20:41Z

lazyllm/tools/data/operators/math_ops.py

+@data_register('data.Text2qa', rewrite_func='forward')
+def wrong_answer_filter(data, input_key, min_score):
+    score = data.get(input_key, 0)
+    if score >= min_score:
+        return None
+    return []


The function wrong_answer_filter is a duplicate of qa_score_filter, which is already defined in lazyllm/tools/data/operators/text2qa_ops.py. Additionally, it's being registered under the data.Text2qa group from within math_ops.py, which is confusing. To maintain code clarity and avoid duplication, this function should be removed from this file. Any logic that needs it should import and use Text2qa.qa_score_filter.

gemini-code-assist · 2026-03-06T09:20:41Z

lazyllm/tools/data/operators/pdf_ops.py

+            out = self.model(encode_query_with_filepaths(query, local_paths))
+            data[self.query_key] = out.get(self.query_key, '')
+            data[self.answer_key] = out.get(self.answer_key, '')
+            data[self.image_key] = local_paths[0]
+            return data
+        user_prompt = self.user_prompt or '根据下面文本生成一个 QA 对：\n'
+        inp = f'{user_prompt}\n{chunk}'
+        qa = self.model(inp)


This section is vulnerable to prompt injection as untrusted user input from the chunk field is directly concatenated into the LLM prompt, allowing attackers to manipulate the LLM's behavior. Additionally, there's an inconsistency where only the first image path (local_paths[0]) is saved to data[self.image_key], but all extracted image paths are passed to encode_query_with_filepaths. Consider storing all relevant image paths in the data item for consistency.

Suggested change

out = self.model(encode_query_with_filepaths(query, local_paths))

data[self.query_key] = out.get(self.query_key, '')

data[self.answer_key] = out.get(self.answer_key, '')

data[self.image_key] = local_paths[0]

return data

user_prompt = self.user_prompt or '根据下面文本生成一个 QA 对：\n'

inp = f'{user_prompt}\n{chunk}'

qa = self.model(inp)

data[self.image_key] = local_paths

gemini-code-assist · 2026-03-06T09:20:42Z

lazyllm/tools/data/operators/pdf_ops.py

+        image_rel_paths = self._extract_images(chunk)
+        if image_rel_paths:
+            base_dir = os.path.join('lazyllm', 'tools', 'data', 'operators', 'imgs')
+            os.makedirs(base_dir, exist_ok=True)
+            local_paths = []
+            for rel_path in image_rel_paths:
+                filename = os.path.basename(rel_path)
+                local_path = os.path.join(base_dir, filename)
+                local_paths.append(local_path)
+                src_path = os.path.join(self.mineru_api, rel_path) if self.mineru_api else rel_path
+                if src_path.startswith('http'):
+                    import requests
+                    r = requests.get(src_path)
+                    with open(local_path, 'wb') as f:
+                        f.write(r.content)
+                else:
+                    import shutil
+                    shutil.copy(src_path, local_path)


The PdfChunkToQA.forward method extracts image paths from the chunk input using a regular expression and then uses these paths in shutil.copy without proper validation. An attacker can provide a specially crafted chunk containing a path like ![alt](images/../../../../etc/passwd) to read arbitrary files from the server. These files are then potentially exfiltrated through the LLM response.

gemini-code-assist · 2026-03-06T09:20:42Z

lazyllm/tools/data/operators/pdf_ops.py

+        user_prompt = self.user_prompt or f'''
+请根据下面内容和图片(可以没有图片)对 QA 打分：
+
+原文：
+{chunk}
+
+图片路径：
+{img_path}
+
+{qa_payload}
+
+规则：
+- 严格基于原文和图片 → 1
+- 否则 → 0
+'''
+        res = self.model(encode_query_with_filepaths(user_prompt, [img_path]))


The PdfQAScorer operator concatenates untrusted input from chunk, query, and answer fields into the prompt sent to the LLM, making it vulnerable to prompt injection.

gemini-code-assist · 2026-03-06T09:20:42Z

lazyllm/tools/data/operators/pdf_ops.py

+        if not (chunk and query and answer and img_path):
+            data[self.output_key] = 0
+            return data


The condition if not (chunk and query and answer and img_path): incorrectly makes img_path a required field. However, the prompt indicates that an image is optional ("可以没有图片"). This check will cause samples without images to be incorrectly assigned a score of 0. The check should only validate the required fields.

Suggested change

if not (chunk and query and answer and img_path):

data[self.output_key] = 0

return data

if not (chunk and query and answer):

data[self.output_key] = 0

return data

gemini-code-assist · 2026-03-06T09:20:42Z

lazyllm/docs/data_process.py

+筛选样本的算子。
+
+- 如果输入字段为 True，则表示样本正确，保留原始数据用于后续处理。  
+- 如果输入字段为 False，则表示样本错误，返回 None，被过滤掉。


This line in the documentation is incorrect. When the input field is False, the function returns [] to filter out the sample, not None. Also, returning None means the sample is kept, not filtered out, according to the project's data processing rules.

Suggested change

- 如果输入字段为 False，则表示样本错误，返回 None，被过滤掉。

- 如果输入字段为 False，则表示样本错误，返回 []，被过滤掉。

References

In LazyLLM's data processing pipeline, a filter function returning None indicates that the data item should be kept, while returning an empty list ([]) indicates it should be discarded.

gemini-code-assist · 2026-03-06T09:20:42Z

lazyllm/tools/data/operators/pdf_ops.py

+        except Exception:
+            pass


Silently passing on Exception can hide issues during data processing. It's better to log a warning when a value cannot be converted to a float, which would help with debugging and ensure sufficient visibility and context for key operations, as per project logging rules.

Suggested change

except Exception:

pass

except Exception:

LOG.warning(f"Could not convert value to float in multi_features_filter for item: {x}")

References

For key operations, favor detailed INFO level logs over concise DEBUG logs to ensure sufficient visibility and context.

gemini-code-assist · 2026-03-06T09:20:42Z

lazyllm/docs/data_process.py

+Sample filtering operator.
+
+- If the specified field is True, the sample is considered correct and the original data is retained for further processing.  
+- If the specified field is False, the sample is considered wrong and returns None (filtered out).


This line in the documentation is incorrect. When the input field is False, the function returns [] to filter out the sample, not None. Also, returning None means the sample is kept, not filtered out, according to the project's data processing rules.

Suggested change

- If the specified field is False, the sample is considered wrong and returns None (filtered out).

- If the specified field is False, the sample is considered wrong and returns [] (filtered out).

References

In LazyLLM's data processing pipeline, a filter function returning None indicates that the data item should be kept, while returning an empty list ([]) indicates it should be discarded.

gemini-code-assist · 2026-03-06T09:20:42Z

lazyllm/tools/data/operators/cot_ops.py

+    except Exception:
        data[output_key] = False


Removing the error logging here reduces visibility into potential issues during answer verification. Swallowing the exception without logging can make debugging difficult. According to the project's general rules, key operations should have detailed logs to ensure sufficient visibility and context. Please consider re-adding the logging, which will also require re-importing LOG at the top of the file.

Suggested change

except Exception:

data[output_key] = False

except Exception as e:

from lazyllm import LOG

LOG.error(f'Error verifying answers: {e}')

data[output_key] = False

References

For key operations, favor detailed INFO level logs over concise DEBUG logs to ensure sufficient visibility and context.

gemini-code-assist · 2026-03-06T09:20:42Z

lazyllm/docs/data_process.py

+print(op(data1)  # None, kept
+print(op(data2)  # [], filtered out


There appears to be a syntax error in this example code. The print statements are missing their closing parentheses.

Suggested change

print(op(data1) # None, kept

print(op(data2) # [], filtered out

print(op(data1)) # None, kept

print(op(data2)) # [], filtered out

cuishaoting added 7 commits February 28, 2026 11:24

reabse_main

f58d3f3

finish_ppls

8271d84

ppl

f915a6a

add_op_docs

4c03adb

done_text2qadocs

b1caf77

add_text2qa_enhanceqa_ppls

ae4043f

add_docs

83e2bbd

CarlosShaoting requested review from a team as code owners March 6, 2026 09:14

fix_format

3b1b37a

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

CarlosShaoting had a problem deploying to protected March 6, 2026 09:22 — with GitHub Actions Error

done_two_ppls

ad673e0

CarlosShaoting requested a deployment to protected March 6, 2026 09:53 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add text2qa and qa enhancement pipelines for SFT process.#1050

Add text2qa and qa enhancement pipelines for SFT process.#1050
CarlosShaoting wants to merge 9 commits intoLazyAGI:mainfrom
CarlosShaoting:cst/all_ppls

CarlosShaoting commented Mar 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	- 如果输入字段为 False，则表示样本错误，返回 None，被过滤掉。
	- 如果输入字段为 False，则表示样本错误，返回 []，被过滤掉。

	- If the specified field is False, the sample is considered wrong and returns None (filtered out).
	- If the specified field is False, the sample is considered wrong and returns [] (filtered out).

		print(op(data1) # None, kept
		print(op(data2) # [], filtered out

Conversation

CarlosShaoting commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 PR 内容 / PR Description

✅ 变更类型 / Type of Change

🧪 如何测试 / How Has This Been Tested?

📷 截图 / Demo (Optional)

Todo:

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CarlosShaoting commented Mar 6, 2026 •

edited

Loading