Skip to content

test: update unit and integration tests for converters, chunkers, and…#69

Merged
namtroi merged 1 commit into
mainfrom
pdf/optimize
Dec 29, 2025
Merged

test: update unit and integration tests for converters, chunkers, and…#69
namtroi merged 1 commit into
mainfrom
pdf/optimize

Conversation

@namtroi

@namtroi namtroi commented Dec 29, 2025

Copy link
Copy Markdown
Owner

User description

… profiles


PR Type

Tests


Description

  • Updated profile model tests to use flexible assertions instead of hardcoded values

    • Allows schema defaults to be tuned without breaking tests
  • Added comprehensive tests for PDF post-processing methods

    • Tests for page artifact removal and code block cleanup
  • Added header_levels parameter tests for DocumentChunker

    • Validates clamping behavior and breadcrumb tracking
  • Created new PyMuPDFConverter test suite for link stripping

  • Updated TabularChunker test to verify large table handling behavior


Diagram Walkthrough

flowchart LR
  A["Profile Model Tests"] -->|"Flexible assertions"| B["Schema-agnostic validation"]
  C["Converter Tests"] -->|"Post-processing methods"| D["PDF artifact removal"]
  E["PyMuPDF Tests"] -->|"Link stripping"| F["Markdown link handling"]
  G["DocumentChunker Tests"] -->|"Header levels"| H["Breadcrumb tracking"]
  I["TabularChunker Tests"] -->|"Large table behavior"| J["Single chunk verification"]
Loading

File Walkthrough

Relevant files
Tests
profile-model.test.ts
Convert profile defaults to flexible assertions                   

apps/backend/tests/unit/models/profile-model.test.ts

  • Replaced hardcoded default value assertions with flexible range-based
    checks
  • Added comments explaining behavior-driven testing approach
  • Validates reasonable defaults exist without enforcing specific values
  • Allows schema defaults to be tuned independently of tests
+32/-25 
test_base_converter.py
Add PDF post-processing method tests                                         

apps/ai-worker/tests/test_base_converter.py

  • Added TestPostProcessPdf class with 3 test methods for PDF
    post-processing
  • Added TestPostProcessPymupdf class with 4 test methods for
    PyMuPDF-specific processing
  • Tests verify removal of page artifacts, empty code blocks, and soft
    linebreak merging
  • Tests validate full processing chain including normalization
+65/-0   
test_document_chunker.py
Add header_levels parameter validation tests                         

apps/ai-worker/tests/test_document_chunker.py

  • Added TestHeaderLevels class with 7 comprehensive test methods
  • Tests validate default header_levels value of 3
  • Tests verify clamping behavior for values below 1 and above 6
  • Tests confirm breadcrumb tracking for different header level
    configurations
+59/-0   
test_pymupdf_converter.py
Create PyMuPDFConverter link stripping tests                         

apps/ai-worker/tests/test_pymupdf_converter.py

  • Created new test file with TestStripHiddenLinks class
  • Added 10 test methods covering markdown link stripping functionality
  • Tests cover basic links, nested brackets, multiple links, edge cases
  • Documents current behavior including URL parentheses edge case
+84/-0   
test_tabular_chunker.py
Implement large table chunking behavior test                         

apps/ai-worker/tests/test_tabular_chunker.py

  • Replaced placeholder test with actual implementation
  • Tests verify large Markdown tables remain as single chunk
  • Generates 100-row table and validates chunk count and metadata
  • Documents current behavior of keeping Markdown tables intact
+13/-5   

@qodo-code-review

Copy link
Copy Markdown

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@namtroi namtroi merged commit 6db3baf into main Dec 29, 2025
7 checks passed
@qodo-code-review

Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Ensure entire table content is preserved

In test_large_markdown_table_stays_single_chunk, strengthen the assertion to
verify that the chunk's content is identical to the original table input, not
just that it contains a specific string.

apps/ai-worker/tests/test_tabular_chunker.py [82-94]

 def test_large_markdown_table_stays_single_chunk(self, chunker):
     """Large Markdown tables remain as single chunk (current behavior)."""
     # Generate a large table with 100 rows
     header = "| Name | Age | City |\n|---|---|---|\n"
     rows = "| Alice | 30 | NYC |\n" * 100
     large_table = header + rows
 
     chunks = chunker.chunk(large_table)
 
     # Current behavior: Markdown tables are kept as single chunk
     assert len(chunks) == 1
     assert chunks[0]["metadata"]["chunk_type"] == "tabular"
-    assert "Alice" in chunks[0]["content"]
+    assert chunks[0]["content"] == large_table
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out a weakness in the test and proposes a stricter assertion to ensure the entire table content is preserved, which improves test robustness.

Medium
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant