Rust implementation#392
Conversation
* Parsing and Partial Validation done * commented changes are executed * Add link to demo in readme * corrected the error collection * draft validator * updates .gitignore * updates .gitignore * updated .gitignore * updated validation * target folder removed --------- Co-authored-by: kavitharaju <kavitharaju18@gmail.com> Co-authored-by: Joel Mathew <mail@joelmath.io>
* On branch rust new file: rust-usfm-parser/Cargo.toml new file: rust-usfm-parser/src/main.rs * rust project init * rust project folder init * rust floder init * added README.md * read me * readme * readme * rust init * rust init * rust init * node to json * Add link to demo in readme * usjGenerator * all markers added * all markers added. * more type support added * more type support added * Rename main.rs to usjGenerator.rs * Delete rust-usfm-parser/src/usj_generator.rs * Delete rust-usfm-parser/src/usj_generator.js * Delete rust-usfm-parser/src/main * usj_generator added modified: rust-usfm-parser/Cargo.lock modified: rust-usfm-parser/Cargo.toml renamed: rust-usfm-parser/src/input.usfm -> rust-usfm-parser/input.usfm modified: rust-usfm-parser/src/main.rs new file: rust-usfm-parser/src/usj_generator.rs * new file: rust-usfm-parser/src/globals.rs modified: rust-usfm-parser/src/main.rs modified: rust-usfm-parser/src/parser.rs modified: rust-usfm-parser/src/usj_generator.rs * Delete rust-usfm-parser/target directory * Update usjGenerator.rs * Delete .idea directory * Delete rust-usfm-parser/target directory * Delete .idea directory * Update package.json * Update usj_generator.py * Delete rust-usfm-parser/src/usjGenerator.rs * Update usj_generator.rs * Update validator.rs * Update schema.rs * Update parser.rs * Update main.rs * Update globals.rs * Delete rust-usfm-parser/captures.rs * Delete rust-usfm-parser/captures.rs * fixed some bugs modified: rust-usfm-parser/input.usfm modified: rust-usfm-parser/src/usj_generator.rs * Delete node-usfm-parser/package-lock.json * Update input.usfm --------- Co-authored-by: kavitharaju <kavitharaju18@gmail.com> Co-authored-by: Joel Mathew <mail@joelmath.io>
* Add link to demo in readme * lib added new file: rust-usfm-parser/src/lib.rs * lib edit modified: rust-usfm-parser/src/lib.rs * test cases added ,filter addded to parser.rs * Test suit verified * added seperate files for test utils and test_usj functions * added seperate files for test_utils and test_usj functions --------- Co-authored-by: kavitharaju <kavitharaju18@gmail.com> Co-authored-by: Joel Mathew <mail@joelmath.io>
* On branch rust new file: rust-usfm-parser/Cargo.toml new file: rust-usfm-parser/src/main.rs * rust project init * rust project folder init * rust floder init * added README.md * read me * readme * readme * rust init * rust init * rust init * node to json * Add link to demo in readme * usjGenerator * all markers added * all markers added. * more type support added * more type support added * Rename main.rs to usjGenerator.rs * Delete rust-usfm-parser/src/usj_generator.rs * Delete rust-usfm-parser/src/usj_generator.js * Delete rust-usfm-parser/src/main * usj_generator added modified: rust-usfm-parser/Cargo.lock modified: rust-usfm-parser/Cargo.toml renamed: rust-usfm-parser/src/input.usfm -> rust-usfm-parser/input.usfm modified: rust-usfm-parser/src/main.rs new file: rust-usfm-parser/src/usj_generator.rs * new file: rust-usfm-parser/src/globals.rs modified: rust-usfm-parser/src/main.rs modified: rust-usfm-parser/src/parser.rs modified: rust-usfm-parser/src/usj_generator.rs * Delete rust-usfm-parser/target directory * Update usjGenerator.rs * Delete .idea directory * Delete rust-usfm-parser/target directory * Delete .idea directory * Update package.json * Update usj_generator.py * Delete rust-usfm-parser/src/usjGenerator.rs * Update usj_generator.rs * Update validator.rs * Update schema.rs * Update parser.rs * Update main.rs * Update globals.rs * Delete rust-usfm-parser/captures.rs * Delete rust-usfm-parser/captures.rs * fixed some bugs modified: rust-usfm-parser/input.usfm modified: rust-usfm-parser/src/usj_generator.rs * Delete node-usfm-parser/package-lock.json * Update input.usfm * lib added new file: rust-usfm-parser/src/lib.rs * lib edit modified: rust-usfm-parser/src/lib.rs * test cases added ,filter addded to parser.rs * Test suit verified * added seperate files for test utils and test_usj functions * added seperate files for test_utils and test_usj functions * format of the errors have updated for test functions * format of the errors have updated for test functions --------- Co-authored-by: amal-J-k <22l024@gecskp.ac.in> Co-authored-by: amal-J-k <143809111+amal-J-k@users.noreply.github.com> Co-authored-by: kavitharaju <kavitharaju18@gmail.com> Co-authored-by: Joel Mathew <mail@joelmath.io>
* latest usj_generator modified: rust-usfm-parser/src/main.rs modified: rust-usfm-parser/src/parser.rs modified: rust-usfm-parser/src/usj_generator.rs modified: rust-usfm-parser/tests/test_usj.rs * cahnged node2_usj modified: rust-usfm-parser/src/main.rs modified: rust-usfm-parser/src/usj_generator.rs * usj_generator changed modified: rust-usfm-parser/src/usj_generator.rs * small changes in node_2_usj modified: rust-usfm-parser/src/usj_generator.rs * debug usj_generator.rs modified: rust-usfm-parser/src/usj_generator.rs * major bug that caused repetition of fixed modified: rust-usfm-parser/src/usj_generator.rs
* latest usj_generator modified: rust-usfm-parser/src/main.rs modified: rust-usfm-parser/src/parser.rs modified: rust-usfm-parser/src/usj_generator.rs modified: rust-usfm-parser/tests/test_usj.rs * cahnged node2_usj modified: rust-usfm-parser/src/main.rs modified: rust-usfm-parser/src/usj_generator.rs * usj_generator changed modified: rust-usfm-parser/src/usj_generator.rs * small changes in node_2_usj modified: rust-usfm-parser/src/usj_generator.rs * debug usj_generator.rs modified: rust-usfm-parser/src/usj_generator.rs * major bug that caused repetition of fixed modified: rust-usfm-parser/src/usj_generator.rs * Update usj_generator.rs
* threading * implimented threading in the new code * updated gitignore * implimented threading in the new code
* Update grammar to support char in rem * Bump version: 3.0.0 → 3.0.1-alpha.1 * Add new test case to check char in rem support
* Fix issue of missing contents when paras are excluded * Always include BCV when converting to List * Copy changes in node to web also * Test and fix * Bump version: 3.0.1-alpha.1 → 3.0.1-alpha.2 * Include link to issue in \rem test case
* Implement biblenlp to usfm conversion * Add a test case for BibleNlp to USFM conversion * Include BibleNLP's default vref in code base * Implement CLI for in_format biblenlp * Add usage instruction for from_biblenlp option in README * Fix some linting issues * Fix more linting issues * Copy biblenlp to usfm conversion logic to node * Add tests in node * Minor test fix * Copy the conversion logic to web module * Tests for web module * Documentation for BibleNlp input usage in web and node modules * Bump version: 3.0.1-alpha.2 → 3.0.1-alpha.3 * Correct Readmes * Correct Readmes * Add links to inner readme files from the outer readme * Allow optional bookcode arg when using full vref and/or texts * Allow optional bookcode arg when using full vref in Node * Allow optional bookcode arg when using full vref in Web module * Test and fix node and web * Allow text list to be of length 23213, 31170 or 41899 * Handle the changes in python CLI * Ignore pylint warning * Ignore pylint warning * Ignore pylint warning
Bumps [tar-fs](https://github.com/mafintosh/tar-fs) from 2.1.1 to 2.1.2. - [Commits](mafintosh/tar-fs@v2.1.1...v2.1.2) --- updated-dependencies: - dependency-name: tar-fs dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Fix the USJ conversion when there is nestedchar in footnote. In python * Add a test case with char in footnote with and without nesting * Fix the USX conversion in python * Fix the issue in node and web * Update grammar to allow empty \ft and other notes elements * update the usj output in testsuite
* Output version in cli * Bump version: 3.0.1-alpha.3 → 3.0.1-alpha.4 * Validate the usfm starts with \ before parsing * Exempt empty file tests * Change to_list() to generate reference list and change how include exclude works * Implement toList() changes in node * Implement toList() changes in web * Update error message in test for usfm sanity check * Bump version: 3.0.1-alpha.4 → 3.1.0 * Handle linting issues
Upgrades the versions of github actions due to the deprecation. - upload-artifact@v3 to upload-artifacts@v4 - download-artifacts@v3 to download-artifacts@v4
* fix the special handling of c and v not reachable * List xmldom as a peerdependency * Bump version: 3.1.0 → 3.1.1
* Update grammar to allow v and empty string in table cell * Bump version: 3.1.1 → 3.1.2 * Add test for empty table cell and v in table * In USX gen, to add verse end, find the previous rows last cell when in for cell of current row * save state of last versetext parent and sid in python * Implement the fix in node and web * Minor change
Bumps [tar-fs](https://github.com/mafintosh/tar-fs) from 2.1.2 to 2.1.3. - [Commits](https://github.com/mafintosh/tar-fs/commits) --- updated-dependencies: - dependency-name: tar-fs dependency-version: 2.1.3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Add grammar rules for \tcc * Update python methods to include \tcc * Add testcase for \tcc * Update web module to support \tcc * Update node module to support \tcc * Fix linting issues * Fix typo
* Update testpypi_publish.yml * Update testpypi_publish.yml * Update pypi_publish.yml * Update pypi_publish.yml * Update pypi_publish.yml * Update pypi_publish.yml * Update pypi_publish.yml
* Allow multiple \sr and \r markers under one \s# * Bump version: 3.1.2 → 3.1.3
Bumps [deepdiff](https://github.com/seperman/deepdiff) from 7.0.1 to 8.6.1. - [Release notes](https://github.com/seperman/deepdiff/releases) - [Changelog](https://github.com/seperman/deepdiff/blob/master/docs/changelog.rst) - [Commits](https://github.com/seperman/deepdiff/commits) --- updated-dependencies: - dependency-name: deepdiff dependency-version: 8.6.1 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [jupyterlab](https://github.com/jupyterlab/jupyterlab) from 3.6.8 to 4.4.8. - [Release notes](https://github.com/jupyterlab/jupyterlab/releases) - [Changelog](https://github.com/jupyterlab/jupyterlab/blob/main/RELEASE.md) - [Commits](https://github.com/jupyterlab/jupyterlab/compare/@jupyterlab/vdom@3.6.8...@jupyterlab/lsp@4.4.8) --- updated-dependencies: - dependency-name: jupyterlab dependency-version: 4.4.8 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [tar-fs](https://github.com/mafintosh/tar-fs) from 2.1.3 to 2.1.4. - [Commits](mafintosh/tar-fs@v2.1.3...v2.1.4) --- updated-dependencies: - dependency-name: tar-fs dependency-version: 2.1.4 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>
… out generated ones in comparison tests
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI is running the review. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
Sequence DiagramThis diagram shows how the new Rust USFMParser accepts multiple input formats, builds a syntax tree with tree sitter, and converts the content into USJ, USX, table, USFM, or BibleNLP formats with optional marker filtering. sequenceDiagram
participant Client
participant USFMParser
participant TreeSitter
participant Converters
Client->>USFMParser: Create from USFM USJ USX or BibleNLP data
USFMParser->>TreeSitter: Parse USFM string into syntax tree and collect errors
TreeSitter-->>USFMParser: Syntax tree plus errors warnings
Client->>USFMParser: Request output in chosen format
USFMParser->>USFMParser: Apply marker filters and combine adjacent text
USFMParser->>Converters: Convert filtered USJ to USJ USX table USFM or BibleNLP
Converters-->>USFMParser: Target format data
USFMParser-->>Client: Return converted output or write files
Generated by CodeAnt AI |
|
CodeAnt AI finished reviewing your PR. |
| let (vrefs, texts): (Vec<&str>, Vec<&str>) = if let Some(ref bc) = book_code_upper { | ||
| let pairs: Vec<(&str, &str)> = input | ||
| .vref | ||
| .iter() | ||
| .zip(input.text.iter()) | ||
| .filter(|(r, _)| r.trim().to_uppercase().starts_with(bc.as_str())) | ||
| .map(|(r, t)| (r.as_str(), t.as_str())) | ||
| .collect(); | ||
| pairs.iter().map(|(r, t)| (*r, *t)).unzip() |
There was a problem hiding this comment.
🟠 Architect Review — HIGH
In biblenlp_to_usfm, when a book_code is provided, vref and text are zipped and filtered before the length check, so mismatched input lists can silently drop extra verses instead of raising the intended length/versification error.
Suggestion: Mirror the Python logic by validating the original vref/text lengths before zipping, and when filtering by book_code, filter both lists in lockstep and then enforce a strict length equality check that raises a ParsingError on mismatch instead of truncating.
Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is an **Architect / Logical Review** comment left during a code review. These reviews are first-class, important findings — not optional suggestions. Do NOT dismiss this as a 'big architectural change' just because the title says architect review; most of these can be resolved with a small, localized fix once the intent is understood.
**Path:** rust-usfm-parser/src/usfm_generator.rs
**Line:** 429:437
**Comment:**
*HIGH: In biblenlp_to_usfm, when a book_code is provided, vref and text are zipped and filtered before the length check, so mismatched input lists can silently drop extra verses instead of raising the intended length/versification error.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
If a suggested approach is provided above, use it as the authoritative instruction. If no explicit code suggestion is given, you MUST still draft and apply your own minimal, localized fix — do not punt back with 'no suggestion provided, review manually'. Keep the change as small as possible: add a guard clause, gate on a loading state, reorder an await, wrap in a conditional, etc. Do not refactor surrounding code or expand scope beyond the finding.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| let align = if style.contains("tcc") || style.contains("thc") { | ||
| "center" | ||
| } else if style.ends_with('r') { | ||
| "end" | ||
| } else { | ||
| "start" |
There was a problem hiding this comment.
🟠 Architect Review — HIGH
In USXGenerator::node_2_usx_table, table cell alignment uses style.ends_with('r'), so numbered right-aligned markers like "tcr3" and "thr3" are misclassified as "start" rather than "end", diverging from the USJGenerator alignment logic and the tests' expectations.
Suggestion: Normalize the alignment rule for table cells to match USJGenerator (e.g. detect right alignment with style.contains('r') while keeping the special tcc/thc center check) so numbered tcr*/thr* styles are correctly marked as end-aligned in USX output.
Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is an **Architect / Logical Review** comment left during a code review. These reviews are first-class, important findings — not optional suggestions. Do NOT dismiss this as a 'big architectural change' just because the title says architect review; most of these can be resolved with a small, localized fix once the intent is understood.
**Path:** rust-usfm-parser/src/usx_generator.rs
**Line:** 615:620
**Comment:**
*HIGH: In USXGenerator::node_2_usx_table, table cell alignment uses style.ends_with('r'), so numbered right-aligned markers like "tcr3" and "thr3" are misclassified as "start" rather than "end", diverging from the USJGenerator alignment logic and the tests' expectations.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
If a suggested approach is provided above, use it as the authoritative instruction. If no explicit code suggestion is given, you MUST still draft and apply your own minimal, localized fix — do not punt back with 'no suggestion provided, review manually'. Keep the change as small as possible: add a guard clause, gate on a loading state, reorder an await, wrap in a conditional, etc. Do not refactor surrounding code or expand scope beyond the finding.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| fn punct_no_space_after() -> &'static Regex { | ||
| static RE: OnceLock<Regex> = OnceLock::new(); | ||
| RE.get_or_init(|| Regex::new(r#"[,.\-—/;:!?@$%^)}\]>»]$"#).unwrap()) | ||
| } |
There was a problem hiding this comment.
Suggestion: The "no-space-after" punctuation regex is inverted: it currently matches trailing closing punctuation like . and ,, which suppresses spaces after sentence punctuation and can merge words incorrectly during text combination. Use the opening-punctuation set for this check (as in the Python behavior) so spaces are only suppressed where appropriate. [logic error]
Severity Level: Critical 🚨
- ❌ BibleNLP text output merges words after sentence punctuation.
- ❌ USJ output with combine_texts=true mis-spaces verse text.
- ⚠️ Table/list exports with combine_texts=true may join tokens.Steps of Reproduction ✅
1. Build and run the Rust CLI in `rust-usfm-parser/src/main.rs` with a real USFM file that
contains sentence-ending punctuation, e.g. `tests/usfmjsTests/psa_quotes/origin.usfm`
(verified in that file at `tests/usfmjsTests/psa_quotes/origin.usfm:13-16` with text
ending in `,` and `.`).
2. Invoke BibleNLP export so that the Rust code paths are exercised by running:
`usfm-grammar --infile tests/usfmjsTests/psa_quotes/origin.usfm --out_format biblenlp
--ignore_errors` which hits the `match out_format` arm `"biblenlp"` in
`rust-usfm-parser/src/main.rs:27-48` and calls `parser.to_biblenlp_format(ignore_errors)`
at `main.rs:41`.
3. Inside `USFMParser::to_biblenlp_format` in
`rust-usfm-parser/src/usfm_parser.rs:380-401`, the parser builds a USJ tree and then calls
`include_markers_in_usj(&usj_root, &keep, true)` at `usfm_parser.rs:17-19`, passing
`combine_texts = true`, which in turn causes `include_inner` in
`rust-usfm-parser/src/filter.rs:77-146` to call `combine_consecutive_text_contents` at
`filter.rs:128-129`.
4. In `combine_consecutive_text_contents` (`rust-usfm-parser/src/filter.rs:37-67`), the
decision to insert a space between accumulated `buffer` and incoming `text` uses
`!punct_no_space_before().is_match(&text) && !punct_no_space_after().is_match(&buffer)` at
`filter.rs:47-48`. However, `punct_no_space_after` is currently defined as
`Regex::new(r#"[,.\-—/;:!?@$%^)}\]>»]$"#)` at `filter.rs:16-18`, which matches *closing*
punctuation like `.` and `,`. As a result, when a verse fragment ends with `.` (e.g. the
`mockers.` at `psa_quotes/origin.usfm:15`) and the next fragment starts with a word,
`punct_no_space_after().is_match(&buffer)` returns true and `add_space` becomes false,
producing concatenated output like `mockers.Blessed` in the generated BibleNLP text
instead of `mockers. Blessed`. Comparing to the reference JavaScript implementation in
`web-usfm-parser/src/filters.js:12-16` shows that `punctPatternNoSpaceAfter` there
intentionally matches *opening* punctuation (`/[\-—/`@^&({[<"«]$/`), so the Rust version's
use of closing punctuation is inverted and causes incorrect space suppression after
sentence punctuation.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** rust-usfm-parser/src/filter.rs
**Line:** 16:19
**Comment:**
*Logic Error: The "no-space-after" punctuation regex is inverted: it currently matches trailing closing punctuation like `.` and `,`, which suppresses spaces after sentence punctuation and can merge words incorrectly during text combination. Use the opening-punctuation set for this check (as in the Python behavior) so spaces are only suppressed where appropriate.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| } else if is_biblenlp { | ||
| Err("BibleNLP input is not yet supported in this CLI. \ | ||
| Supply a USFM, USJ, or USX file instead.".into()) |
There was a problem hiding this comment.
Suggestion: --in_format biblenlp is advertised as a valid input format, but this branch always returns an error instead of constructing a parser from BibleNLP data. This breaks the CLI contract and makes BibleNLP input unusable. Implement BibleNLP parsing here (including reading vref data) and call USFMParser::from_biblenlp(...) instead of hard-failing. [logic error]
Severity Level: Critical 🚨
- ❌ CLI cannot import BibleNLP data via --in_format biblenlp.
- ⚠️ Rust CLI diverges from BibleNLP support promised in PR.
- ⚠️ Prevents round-tripping BibleNLP → USFM in Rust tool.Steps of Reproduction ✅
1. Build the Rust CLI binary for this crate (e.g. `cargo build -p rust-usfm-parser`) so
that `rust-usfm-parser/src/main.rs` is compiled into an executable using `main()` at lines
130–404.
2. Run the CLI with an explicit BibleNLP input format, for example:
`target/debug/rust-usfm-parser --in_format biblenlp some_input.txt` (the `in_format` flag
is accepted because it is registered with `value_parser(["usfm", "usj", "usx",
"biblenlp"])` in `main` at lines 153–157).
3. In `main` at lines 243–249, the program calls `handle_input(infile,
Some(in_format.unwrap_or("usfm")))`, passing `"biblenlp"` as `in_format`; inside
`handle_input` (lines 92–124), `fmt_str` is computed and `is_biblenlp` is set to `true` at
line 107 when `fmt_str.as_deref() == Some("biblenlp")`.
4. The `else if is_biblenlp` branch at lines 117–119 is taken and returns `Err("BibleNLP
input is not yet supported in this CLI. Supply a USFM, USJ, or USX file
instead.".into())`, which `main` treats as a fatal error at lines 243–249
(`eprintln!("Error: {}", e); process::exit(1);`), so any `--in_format biblenlp` invocation
fails even though the flag is advertised as valid input and
`USFMParser::from_biblenlp(...)` exists in `rust-usfm-parser/src/usfm_parser.rs:20–28`.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** rust-usfm-parser/src/main.rs
**Line:** 117:119
**Comment:**
*Logic Error: `--in_format biblenlp` is advertised as a valid input format, but this branch always returns an error instead of constructing a parser from BibleNLP data. This breaks the CLI contract and makes BibleNLP input unusable. Implement BibleNLP parsing here (including reading vref data) and call `USFMParser::from_biblenlp(...)` instead of hard-failing.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| for row in rows { | ||
| let line = [ | ||
| row.book.as_str(), | ||
| row.chapter.as_str(), | ||
| row.verse.as_str(), | ||
| row.text.as_str(), | ||
| row.r#type.as_str(), | ||
| row.marker.as_str(), | ||
| ] | ||
| .join(col_sep); | ||
| write!(out, "{}{}", line, row_sep).unwrap(); |
There was a problem hiding this comment.
Suggestion: Table output is generated by raw string joining, so any field containing the chosen separator or row separator will corrupt the table structure (shifted columns or broken rows). Use a proper CSV/TSV writer with escaping/quoting instead of manual join. [logic error]
Severity Level: Major ⚠️
- ⚠️ Table/CSV exports can be malformed when text has separators.
- ⚠️ Downstream spreadsheet imports may misalign verse columns.
- ⚠️ Rust CLI behavior diverges from Python csv.writer implementation.Steps of Reproduction ✅
1. Build the Rust CLI binary for this crate (e.g. `cargo build -p rust-usfm-parser`) so
`main()` in `rust-usfm-parser/src/main.rs:130–404` is used.
2. Prepare an input that leads to a `ListRow` whose `text` includes the chosen column or
row separator; for example, create a small USFM file under `tests-data/` where a verse
text contains a literal tab character (`\t`) or newline, which `USFMParser::to_list(...)`
ultimately copies into `ListRow.text` without stripping (see
`rust-usfm-parser/src/list_generator.rs:141–169` where `text.to_string()` is assigned
directly).
3. Run the CLI with table output: `target/debug/rust-usfm-parser --out_format table
--csv_col_sep "\t" --csv_row_sep "\n" path/to/that.usfm`. In `main` at lines 297–307, the
program calls `parser.to_list(...)` and iterates `rows` at lines 308–320.
4. For each row, `main.rs` lines 311–319 build `line` by `[..].join(col_sep)` and then
write it with `write!(out, "{}{}", line, row_sep)` at line 320; when `row.text` contains
`col_sep` or `row_sep`, the output row gains extra separators or embedded newlines, so
downstream CSV/TSV consumers (e.g. Python's `csv.reader` or spreadsheet tools) see shifted
columns or split rows instead of one logical record—unlike the Python CLI, which uses
`csv.writer` with proper quoting in `py-usfm-parser/src/usfm_grammar/__main__.py:18–30`.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** rust-usfm-parser/src/main.rs
**Line:** 310:320
**Comment:**
*Logic Error: Table output is generated by raw string joining, so any field containing the chosen separator or row separator will corrupt the table structure (shifted columns or broken rows). Use a proper CSV/TSV writer with escaping/quoting instead of manual `join`.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| let v_end = append_empty(parent, "verse"); | ||
| v_end.set_attr("eid", eid.as_str()); | ||
| } |
There was a problem hiding this comment.
Suggestion: Verse end milestones are closed on the current parent instead of the paragraph/row that actually owned the previous verse content. When a verse boundary crosses into a new paragraph or table row, this will attach the eid to the wrong node and produce incorrect USX verse structure. Track and use the previous verse container (like the Python prev_verse_parent) when emitting the previous eid. [logic error]
Severity Level: Critical 🚨
- ❌ USFM-to-USX conversion misplaces verse closing milestones.
- ⚠️ Round-trip conversions can desynchronize verse boundaries.Steps of Reproduction ✅
1. Construct a `USXGenerator` with USFM input where a verse ends in one paragraph and the
next `\v` marker starts a new paragraph (e.g., `\p \v 1 ...` followed by `\p \v 2 ...` in
the same chapter) and obtain a tree-sitter AST root node.
2. Call `USXGenerator::get_usx` at `rust-usfm-parser/src/usx_generator.rs:142`, which
dispatches to `node_2_usx_chapter` for the chapter node at
`rust-usfm-parser/src/usx_generator.rs:293`.
3. Inside `node_2_usx_chapter` (lines 293–301), the first `paragraph` child is handled by
`node_2_usx` with `parent` pointing to the USX root; `node_2_usx_para` (lines 460–514)
creates the first `<para>` and, when it encounters `v 1`, calls `node_2_usx_verse` (lines
376–433) with `parent` pointing to that first `<para>`, setting `prev_verse_sid_to_close =
Some("BOOK 1:1")`.
4. When the second `paragraph` (containing `v 2`) is processed, `node_2_usx_para` creates
a second `<para>`; its `v 2` node again calls `node_2_usx_verse`, but at entry the
`parent` now points to the second `<para>`. The code at lines 379–381 appends `<verse
eid="BOOK 1:1"/>` to the second `<para>`, so the end milestone for verse 1 is attached to
the wrong container instead of the paragraph or row that actually contained verse 1's
content.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** rust-usfm-parser/src/usx_generator.rs
**Line:** 379:381
**Comment:**
*Logic Error: Verse end milestones are closed on the current `parent` instead of the paragraph/row that actually owned the previous verse content. When a verse boundary crosses into a new paragraph or table row, this will attach the `eid` to the wrong node and produce incorrect USX verse structure. Track and use the previous verse container (like the Python `prev_verse_parent`) when emitting the previous `eid`.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| // #[test] | ||
| fn test_compare_usx_with_testsuite_samples() { |
There was a problem hiding this comment.
Suggestion: The main test that compares generated USX with testsuite reference XML is not annotated as a test, so it never runs. This silently removes regression coverage for exact USX output differences. Re-enable the test annotation so comparison failures are caught in CI. [code quality]
Severity Level: Critical 🚨
- ❌ `cargo test` omits USX-vs-origin.xml regression coverage.
- ⚠️ Structural USX changes may ship without detection.
- ⚠️ Differences in normalized whitespace may go unnoticed.
- ⚠️ Bible testsuite origin.xml alignment not enforced by Rust tests.Steps of Reproduction ✅
1. Open `rust-usfm-parser/tests/test_usx_conversion.rs` and locate the section `//
test_compare_usx_with_testsuite_samples` starting at line 146.
2. Observe at lines 213–214 that the test attribute is commented out: `// #[test]`
followed by `fn test_compare_usx_with_testsuite_samples() {`, meaning this function is a
plain helper and not a Rust test.
3. Search for usages with `rg "test_compare_usx_with_testsuite_samples" -n
/workspace/usfm-grammar` and confirm that the only occurrence is this definition; no other
test or module calls it, so it is never executed during `cargo test`.
4. Run the Rust test suite for this crate (e.g., `cd rust-usfm-parser && cargo test`) and
note that only `test_successful_usx_conversion` (lines 73–89),
`test_all_markers_are_in_usx_output` (lines 95–143), and `test_usx_round_tripping` (lines
294–339) execute; there is no output or failure related to exact USX-vs-origin.xml
comparison even though `test_compare_usx_with_testsuite_samples` contains that logic at
lines 215–287, so regressions in exact USX serialization can pass unnoticed.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** rust-usfm-parser/tests/test_usx_conversion.rs
**Line:** 213:214
**Comment:**
*Code Quality: The main test that compares generated USX with testsuite reference XML is not annotated as a test, so it never runs. This silently removes regression coverage for exact USX output differences. Re-enable the test annotation so comparison failures are caught in CI.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| if !parser.errors.is_empty() { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Suggestion: This test silently skips files that fail parsing instead of failing the test, which can hide real regressions in marker preservation. Treat parse errors as failures here so a valid fixture that starts failing does not get ignored. [logic error]
Severity Level: Major ⚠️
- ⚠️ Marker coverage test may pass while fixture parsing fails.
- ⚠️ Parser regressions may slip through CI unnoticed.Steps of Reproduction ✅
1. Run `cargo test test_usj_all_markers_are_in_output` in `rust-usfm-parser/` so the test
at `rust-usfm-parser/tests/test_json_conversion.rs:214-246` executes.
2. Inside this test, the loop over `positive_files()` (constructed from `negative_tests()`
in `rust-usfm-parser/tests/common.rs:126-131`) initializes a parser for each `origin.usfm`
via `initialise_parser` at `rust-usfm-parser/tests/common.rs:176-181`.
3. For any supposedly "positive" fixture where `USFMParser::new` produces errors,
`parser.errors` is non-empty and the guard at `test_json_conversion.rs:221-223` (`if
!parser.errors.is_empty() { continue; }`) skips that file without pushing anything into
the local `failures` vector.
4. The final assertion at `test_json_conversion.rs:245` only checks `failures.is_empty()`,
so the test still passes even though at least one "valid" USFM fixture failed to parse and
its marker preservation was never validated (unlike the Python original at
`py-usfm-parser/tests/test_json_conversion.py:12-21`, which asserts on
`test_parser.errors`).Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** rust-usfm-parser/tests/test_json_conversion.rs
**Line:** 221:223
**Comment:**
*Logic Error: This test silently skips files that fail parsing instead of failing the test, which can hide real regressions in marker preservation. Treat parse errors as failures here so a valid fixture that starts failing does not get ignored.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| if !parser1.errors.is_empty() { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Suggestion: In the round-trip test, parse errors in the source file are ignored with continue, so the test can falsely pass without exercising USJ→USFM round-tripping for affected fixtures. Record these as failures instead of skipping. [logic error]
Severity Level: Critical 🚨
- ⚠️ Round-trip conversion regressions may not be detected.
- ⚠️ USJ→USFM pipeline untested for some valid fixtures.Steps of Reproduction ✅
1. Run `cargo test test_usj_round_tripping` so the round-trip test at
`rust-usfm-parser/tests/test_json_conversion.rs:255-291` executes.
2. For each file in `positive_files()` (from `tests/common.rs:126-131`), the test calls
`initialise_parser(&path)` at `tests/common.rs:176-181`, creating `parser1` with any parse
diagnostics stored in `parser1.errors`.
3. If a "positive" USFM fixture now produces parse errors (e.g. due to a regression in
`USFMParser::new`), the guard at `test_json_conversion.rs:260-262` (`if
!parser1.errors.is_empty() { continue; }`) skips that fixture entirely, so neither
`parser1.to_usj` (lines 264-269) nor `USFMParser::from_usj` and the re-parse at lines
272-282 are exercised.
4. Because no entry is added to the local `failures` vector for such skipped fixtures, the
final `assert!(failures.is_empty())` at `test_json_conversion.rs:291` still passes,
meaning USFM→USJ→USFM round-tripping is not validated for the failing files (in contrast
to the Python test at `py-usfm-parser/tests/test_json_conversion.py:48-57`, which asserts
on `test_parser1.errors`).Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** rust-usfm-parser/tests/test_json_conversion.rs
**Line:** 260:262
**Comment:**
*Logic Error: In the round-trip test, parse errors in the source file are ignored with `continue`, so the test can falsely pass without exercising USJ→USFM round-tripping for affected fixtures. Record these as failures instead of skipping.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| if !parser.errors.is_empty() { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Suggestion: This comparison test also skips files when parsing fails, which hides regressions and makes results unreliable. The test should fail for unexpected parser errors on positive fixtures. [logic error]
Severity Level: Major ⚠️
- ⚠️ USJ vs origin.json mismatches may go untested.
- ⚠️ Parser failures can hide broken JSON comparison logic.Steps of Reproduction ✅
1. Run `cargo test test_compare_usj_with_testsuite_samples` so the comparison test at
`rust-usfm-parser/tests/test_json_conversion.rs:333-380` executes.
2. For each path returned by `positive_files()` (from `tests/common.rs:126-131`), the test
constructs `parser` via `initialise_parser(&path)` at `tests/common.rs:176-181`,
populating `parser.errors` if the USFM fails to parse.
3. When any "positive" fixture yields non-empty `parser.errors`, the guard at
`test_json_conversion.rs:342-344` (`if !parser.errors.is_empty() { continue; }`) skips
that fixture, so no USJ is generated and no comparison to `origin.json` is performed for
that file.
4. Since skipped files do not add entries to the `failures` vector, the final assertion at
`test_json_conversion.rs:379` (`assert!(failures.is_empty(), ...)`) can still pass,
meaning the test suite may report success while some "valid" fixtures both fail to parse
and never have their USJ compared to the reference JSON (whereas the Python test at
`py-usfm-parser/tests/test_json_conversion.py:198-215` asserts on `test_parser.errors`
before comparison).Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** rust-usfm-parser/tests/test_json_conversion.rs
**Line:** 342:344
**Comment:**
*Logic Error: This comparison test also skips files when parsing fails, which hides regressions and makes results unreliable. The test should fail for unexpected parser errors on positive fixtures.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| let mut origin_usj: Value = match fs::read_to_string(&usj_path) { | ||
| Ok(text) => match serde_json::from_str(&text) { | ||
| Ok(v) => v, | ||
| Err(_) => continue, | ||
| }, | ||
| Err(_) => continue, // no reference file — skip | ||
| }; |
There was a problem hiding this comment.
Suggestion: Errors while reading or parsing origin.json are swallowed and treated as skips, so missing/corrupt reference fixtures are never reported and the test can pass without performing the intended comparison. Only skip missing files if that is intended; malformed JSON should fail the test. [logic error]
Severity Level: Major ⚠️
- ⚠️ Corrupted origin.json fixtures are silently ignored in tests.
- ⚠️ USJ comparison coverage weaker than Python test suite.Steps of Reproduction ✅
1. Run `cargo test test_compare_usj_with_testsuite_samples` to execute the test at
`rust-usfm-parser/tests/test_json_conversion.rs:333-380`.
2. For each "positive" fixture path, the test computes `usj_path =
path.with_file_name("origin.json")` at line 352 and then executes the nested `match` at
lines 353-359, calling `fs::read_to_string(&usj_path)` and `serde_json::from_str(&text)`.
3. If the `origin.json` file exists but is malformed JSON (e.g. truncated or hand-edited
incorrectly), `serde_json::from_str` returns `Err`, and the inner arm at
`test_json_conversion.rs:354-357` executes `Err(_) => continue`, causing the test to
silently skip that fixture instead of recording a failure in the `failures` vector.
4. Because missing files are also skipped via the outer `Err(_) => continue` at line 358
(matching the Python test's `FileNotFoundError` handling at
`py-usfm-parser/tests/test_json_conversion.py:208-217`), the final
`assert!(failures.is_empty())` at `test_json_conversion.rs:379` can still succeed even
when a reference `origin.json` is present but invalid and never actually compared.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** rust-usfm-parser/tests/test_json_conversion.rs
**Line:** 353:359
**Comment:**
*Logic Error: Errors while reading or parsing `origin.json` are swallowed and treated as skips, so missing/corrupt reference fixtures are never reported and the test can pass without performing the intended comparison. Only skip missing files if that is intended; malformed JSON should fail the test.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| let val_str = val.as_str().unwrap_or(""); | ||
| if key == "file" { | ||
| self.usfm_string.push_str(&format!("src=\"{}\" ", val_str)); | ||
| } else { | ||
| self.usfm_string.push_str(&format!("{}=\"{}\" ", key, val_str)); | ||
| } |
There was a problem hiding this comment.
Suggestion: Attribute values are written into quoted USFM attributes without escaping embedded quotes, so a value containing " will break attribute syntax and can corrupt the regenerated USFM. Sanitize or escape quotes in USJ attribute values before interpolation, consistently with the USX conversion path. [possible bug]
Severity Level: Major ⚠️
- ❌ CLI USJ→USX fails when attributes contain double quotes.
- ⚠️ USJ→USFM round-trip produces syntactically malformed USFM.
- ⚠️ Extra attributes after quote lost or misparsed on parse.Steps of Reproduction ✅
1. Prepare a USJ JSON file for the CLI input, using the format consumed in
`rust-usfm-parser/src/main.rs:96-18` when `is_json` is true (extension `.json` or `.usj`).
Include at least one object with an extra attribute whose value contains a double quote,
e.g.:
```json
{
"type": "char",
"marker": "x",
"content": [],
"custom": "foo\"bar"
}
-
Run the rust-usfm-parser CLI (binary with
mainin
rust-usfm-parser/src/main.rs:35-53) with this file asinfile,--in_format=usj(or
.jsonextension),--out_format=usxand without--ignore_errors, sois_jsonis
true atmain.rs:14-18andUSFMParser::from_usj(&usj)is called atmain.rs:15-17. -
Inside
USFMParser::from_usj(rust-usfm-parser/src/usfm_parser.rs:10-17), the code
constructs aUSFMGeneratorand callsusj_to_usfm(usj, false)
(usfm_generator.rs:100-104). For the USJ object with"custom": "foo"bar", the "custom" key is not inNON_ATTRIB_USJ_KEYS(usfm_generator.rs:15-18), so the "Extra attributes" block atusfm_generator.rs:178-195runs:val_strbecomes"foo"bar"(line 187), andself.usfm_string.push_str(&format!("custom="{}" ", val_str));atline 191-192producescustom="foo"bar "`. This string has an embedded double quote that prematurely
closes the attribute value, making the generated USFM syntactically malformed. -
Still inside
from_usj, the malformed USFM is passed toUSFMParser::build
(usfm_parser.rs:45-66), which parses it with tree-sitter and collects ERROR nodes into
self.errors(usfm_parser.rs:67-79). When the CLI later calls
parser.to_usx(ignore_errors)atrust-usfm-parser/src/main.rs:19-21with
ignore_errorsfalse by default (main.rs:201-205,225),guard_errorsat
usfm_parser.rs:42-48sees non-emptyself.errors, returns an error, and the CLI prints
the error and exits with code 1 atmain.rs:21-29. Thus, any USJ input containing an
extra attribute value with a double quote reliably breaks USJ→USFM→USX conversion.
</details>
[Fix in Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt=This%20is%20a%20comment%20left%20during%20a%20code%20review.%0A%0A%2A%2APath%3A%2A%2A%20rust-usfm-parser%2Fsrc%2Fusfm_generator.rs%0A%2A%2ALine%3A%2A%2A%20187%3A192%0A%2A%2AComment%3A%2A%2A%0A%09%2APossible%20Bug%3A%20Attribute%20values%20are%20written%20into%20quoted%20USFM%20attributes%20without%20escaping%20embedded%20quotes%2C%20so%20a%20value%20containing%20%60%22%60%20will%20break%20attribute%20syntax%20and%20can%20corrupt%20the%20regenerated%20USFM.%20Sanitize%20or%20escape%20quotes%20in%20USJ%20attribute%20values%20before%20interpolation%2C%20consistently%20with%20the%20USX%20conversion%20path.%0A%0AValidate%20the%20correctness%20of%20the%20flagged%20issue.%20If%20correct%2C%20How%20can%20I%20resolve%20this%3F%20If%20you%20propose%20a%20fix%2C%20implement%20it%20and%20please%20make%20it%20concise.%0AOnce%20fix%20is%20implemented%2C%20also%20check%20other%20comments%20on%20the%20same%20PR%2C%20and%20ask%20user%20if%20the%20user%20wants%20to%20fix%20the%20rest%20of%20the%20comments%20as%20well.%20if%20said%20yes%2C%20then%20fetch%20all%20the%20comments%20validate%20the%20correctness%20and%20implement%20a%20minimal%20fix%0A) | [Fix in VSCode Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt=This%20is%20a%20comment%20left%20during%20a%20code%20review.%0A%0A%2A%2APath%3A%2A%2A%20rust-usfm-parser%2Fsrc%2Fusfm_generator.rs%0A%2A%2ALine%3A%2A%2A%20187%3A192%0A%2A%2AComment%3A%2A%2A%0A%09%2APossible%20Bug%3A%20Attribute%20values%20are%20written%20into%20quoted%20USFM%20attributes%20without%20escaping%20embedded%20quotes%2C%20so%20a%20value%20containing%20%60%22%60%20will%20break%20attribute%20syntax%20and%20can%20corrupt%20the%20regenerated%20USFM.%20Sanitize%20or%20escape%20quotes%20in%20USJ%20attribute%20values%20before%20interpolation%2C%20consistently%20with%20the%20USX%20conversion%20path.%0A%0AValidate%20the%20correctness%20of%20the%20flagged%20issue.%20If%20correct%2C%20How%20can%20I%20resolve%20this%3F%20If%20you%20propose%20a%20fix%2C%20implement%20it%20and%20please%20make%20it%20concise.%0AOnce%20fix%20is%20implemented%2C%20also%20check%20other%20comments%20on%20the%20same%20PR%2C%20and%20ask%20user%20if%20the%20user%20wants%20to%20fix%20the%20rest%20of%20the%20comments%20as%20well.%20if%20said%20yes%2C%20then%20fetch%20all%20the%20comments%20validate%20the%20correctness%20and%20implement%20a%20minimal%20fix%0A)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** rust-usfm-parser/src/usfm_generator.rs
**Line:** 187:192
**Comment:**
*Possible Bug: Attribute values are written into quoted USFM attributes without escaping embedded quotes, so a value containing `"` will break attribute syntax and can corrupt the regenerated USFM. Sanitize or escape quotes in USJ attribute values before interpolation, consistently with the USX conversion path.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
| let sid = format!( | ||
| "{} {}:{}", | ||
| self.parse_state.book_slug.as_deref().unwrap_or(""), | ||
| self.parse_state.current_chapter.as_deref().unwrap_or(""), | ||
| num | ||
| ); | ||
|
|
||
| let mut v_obj = json!({ | ||
| "type": "verse", | ||
| "marker": "v", | ||
| "number": num, | ||
| "sid": sid, |
There was a problem hiding this comment.
Suggestion: The verse reference id is always constructed and emitted even when book or chapter state is missing, which produces malformed references like empty-book/empty-chapter SIDs on partial or error-tolerant parses. Only set a SID when both required state values are available, otherwise omit it. [logic error]
Severity Level: Major ⚠️
- ⚠️ USJ from to_usj() can contain malformed verse sid fields.
- ⚠️ BibleNLP export via to_biblenlp_format() inherits invalid SIDs.
- ⚠️ Downstream tools consuming sid cannot rely on verse references.Steps of Reproduction ✅
1. Prepare malformed or partial USFM that includes a `\v` verse marker without a preceding
`\id` book marker or `\c` chapter marker, so the tree-sitter parse cannot populate
book/chapter state (grammar for `id` and `chapter` is in
`tree-sitter-usfm3/grammar.js:45-48` and `146-154`).
2. In Rust, call `USFMParser::to_usj(exclude_markers, include_markers, ignore_errors =
true, combine_texts)` from `rust-usfm-parser/src/usfm_parser.rs:26-48` so that
`guard_errors()` allows generation despite parse errors caused by the missing markers.
3. `to_usj()` builds a `USJGenerator` and calls `get_usj(self.syntax_tree.root_node())`;
within `get_usj`, `node_2_usj()` (`rust-usfm-parser/src/usj_generator.rs:382-447`)
dispatches each `v` node to `node_2_usj_verse()` at
`rust-usfm-parser/src/usj_generator.rs:270-330`.
4. In `node_2_usj_verse()`, the `sid` is constructed at `usj_generator.rs:308-313` using
`self.parse_state.book_slug.as_deref().unwrap_or("")` and
`self.parse_state.current_chapter.as_deref().unwrap_or("")`, both still `None` because no
`id` or `c` node was processed, so the resulting USJ verse objects contain malformed
identifiers like `"sid": " :5"` with empty book and chapter components.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** rust-usfm-parser/src/usj_generator.rs
**Line:** 308:319
**Comment:**
*Logic Error: The verse reference id is always constructed and emitted even when book or chapter state is missing, which produces malformed references like empty-book/empty-chapter SIDs on partial or error-tolerant parses. Only set a SID when both required state values are available, otherwise omit it.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| if nestable { | ||
| self.node_2_usj(child, &mut para_obj); | ||
| } else { | ||
| if !node_added { | ||
| Self::push_obj(parent, para_obj.clone()); | ||
| node_added = true; | ||
| } | ||
| self.node_2_usj(child, parent); | ||
| } |
There was a problem hiding this comment.
Suggestion: The paragraph object is pushed as a clone the first time a non-nestable child is seen, and later nestable children are still appended to the local para_obj that is no longer connected to the output tree. This can silently drop trailing inline content from the generated USJ. Push the paragraph object once up front (or mutate the already-pushed node) so all nestable children are preserved. [logic error]
Severity Level: Critical 🚨
- ❌ USJ output from to_usj() drops text after figures.
- ❌ List output from to_list() omits trailing heading content.
- ⚠️ BibleNLP export via to_biblenlp_format() may miss title text.Steps of Reproduction ✅
1. Create a USFM file that uses a generic paragraph-style marker such as `\cd` with inline
text, an embedded figure, and trailing text, for example: `\cd First part \fig ...\fig*
second part`. The `cd` rule allowing `fig` is defined in
`tree-sitter-usfm3/grammar.js:180-186`.
2. In Rust, construct a parser and call `USFMParser::to_usj(..)` from
`rust-usfm-parser/src/usfm_parser.rs:26-48` on this USFM, so it builds a syntax tree and
then invokes `USJGenerator::get_usj(self.syntax_tree.root_node())`.
3. Inside `USJGenerator::node_2_usj()` at `rust-usfm-parser/src/usj_generator.rs:382-447`,
the `cd` node is dispatched to `node_2_usj_generic()`
(`rust-usfm-parser/src/usj_generator.rs:306-352`), which creates a local `para_obj` and
then iterates its children.
4. In `node_2_usj_generic()`, the first `text` child is treated as `nestable` and appended
to `para_obj`, the `fig` child is treated as non-nestable and causes `para_obj.clone()` to
be pushed to `parent` at `usj_generator.rs:738-743`, and any subsequent `text` children
(the "second part") are again treated as `nestable` and appended only to the local
`para_obj` that is no longer pushed. Inspecting the `to_usj()` JSON output shows that text
after the `\fig` (e.g. "second part") is missing from the generated USJ.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** rust-usfm-parser/src/usj_generator.rs
**Line:** 738:746
**Comment:**
*Logic Error: The paragraph object is pushed as a clone the first time a non-nestable child is seen, and later nestable children are still appended to the local `para_obj` that is no longer connected to the output tree. This can silently drop trailing inline content from the generated USJ. Push the paragraph object once up front (or mutate the already-pushed node) so all nestable children are preserved.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix|
CodeAnt AI finished running the review. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
User description