Skip to content

Rust implementation#392

Open
kavitharaju wants to merge 71 commits intomasterfrom
rust
Open

Rust implementation#392
kavitharaju wants to merge 71 commits intomasterfrom
rust

Conversation

@kavitharaju
Copy link
Copy Markdown
Collaborator

@kavitharaju kavitharaju commented Apr 28, 2026

User description

Kali-05 and others added 30 commits January 29, 2025 20:30
* Parsing and Partial Validation done

* commented changes are executed

* Add link to demo in readme

* corrected the error collection

* draft validator

* updates .gitignore

* updates .gitignore

* updated .gitignore

* updated validation

* target folder removed

---------

Co-authored-by: kavitharaju <kavitharaju18@gmail.com>
Co-authored-by: Joel Mathew <mail@joelmath.io>
* On branch rust
	new file:   rust-usfm-parser/Cargo.toml
	new file:   rust-usfm-parser/src/main.rs

* rust project init

* rust project folder init

* rust floder init

* added README.md

* read me

* readme

* readme

* rust init

* rust init

* rust init

* node to json

* Add link to demo in readme

* usjGenerator

* all markers added

* all markers added.

* more type support added

* more type support added

* Rename main.rs to usjGenerator.rs

* Delete rust-usfm-parser/src/usj_generator.rs

* Delete rust-usfm-parser/src/usj_generator.js

* Delete rust-usfm-parser/src/main

* usj_generator added
	modified:   rust-usfm-parser/Cargo.lock
	modified:   rust-usfm-parser/Cargo.toml
	renamed:    rust-usfm-parser/src/input.usfm -> rust-usfm-parser/input.usfm
	modified:   rust-usfm-parser/src/main.rs
	new file:   rust-usfm-parser/src/usj_generator.rs

* new file:   rust-usfm-parser/src/globals.rs
	modified:   rust-usfm-parser/src/main.rs
	modified:   rust-usfm-parser/src/parser.rs
	modified:   rust-usfm-parser/src/usj_generator.rs

* Delete rust-usfm-parser/target directory

* Update usjGenerator.rs

* Delete .idea directory

* Delete rust-usfm-parser/target directory

* Delete .idea directory

* Update package.json

* Update usj_generator.py

* Delete rust-usfm-parser/src/usjGenerator.rs

* Update usj_generator.rs

* Update validator.rs

* Update schema.rs

* Update parser.rs

* Update main.rs

* Update globals.rs

* Delete rust-usfm-parser/captures.rs

* Delete rust-usfm-parser/captures.rs

* fixed some bugs
	modified:   rust-usfm-parser/input.usfm
	modified:   rust-usfm-parser/src/usj_generator.rs

* Delete node-usfm-parser/package-lock.json

* Update input.usfm

---------

Co-authored-by: kavitharaju <kavitharaju18@gmail.com>
Co-authored-by: Joel Mathew <mail@joelmath.io>
* Add link to demo in readme

* lib added
	new file:   rust-usfm-parser/src/lib.rs

* lib edit
	modified:   rust-usfm-parser/src/lib.rs

* test cases added ,filter addded to parser.rs

* Test suit verified

* added seperate files for test utils and test_usj functions

* added seperate files for test_utils and test_usj functions

---------

Co-authored-by: kavitharaju <kavitharaju18@gmail.com>
Co-authored-by: Joel Mathew <mail@joelmath.io>
* On branch rust
	new file:   rust-usfm-parser/Cargo.toml
	new file:   rust-usfm-parser/src/main.rs

* rust project init

* rust project folder init

* rust floder init

* added README.md

* read me

* readme

* readme

* rust init

* rust init

* rust init

* node to json

* Add link to demo in readme

* usjGenerator

* all markers added

* all markers added.

* more type support added

* more type support added

* Rename main.rs to usjGenerator.rs

* Delete rust-usfm-parser/src/usj_generator.rs

* Delete rust-usfm-parser/src/usj_generator.js

* Delete rust-usfm-parser/src/main

* usj_generator added
	modified:   rust-usfm-parser/Cargo.lock
	modified:   rust-usfm-parser/Cargo.toml
	renamed:    rust-usfm-parser/src/input.usfm -> rust-usfm-parser/input.usfm
	modified:   rust-usfm-parser/src/main.rs
	new file:   rust-usfm-parser/src/usj_generator.rs

* new file:   rust-usfm-parser/src/globals.rs
	modified:   rust-usfm-parser/src/main.rs
	modified:   rust-usfm-parser/src/parser.rs
	modified:   rust-usfm-parser/src/usj_generator.rs

* Delete rust-usfm-parser/target directory

* Update usjGenerator.rs

* Delete .idea directory

* Delete rust-usfm-parser/target directory

* Delete .idea directory

* Update package.json

* Update usj_generator.py

* Delete rust-usfm-parser/src/usjGenerator.rs

* Update usj_generator.rs

* Update validator.rs

* Update schema.rs

* Update parser.rs

* Update main.rs

* Update globals.rs

* Delete rust-usfm-parser/captures.rs

* Delete rust-usfm-parser/captures.rs

* fixed some bugs
	modified:   rust-usfm-parser/input.usfm
	modified:   rust-usfm-parser/src/usj_generator.rs

* Delete node-usfm-parser/package-lock.json

* Update input.usfm

* lib added
	new file:   rust-usfm-parser/src/lib.rs

* lib edit
	modified:   rust-usfm-parser/src/lib.rs

* test cases added ,filter addded to parser.rs

* Test suit verified

* added seperate files for test utils and test_usj functions

* added seperate files for test_utils and test_usj functions

* format of the errors have  updated for test functions

* format of the errors have  updated for test functions

---------

Co-authored-by: amal-J-k <22l024@gecskp.ac.in>
Co-authored-by: amal-J-k <143809111+amal-J-k@users.noreply.github.com>
Co-authored-by: kavitharaju <kavitharaju18@gmail.com>
Co-authored-by: Joel Mathew <mail@joelmath.io>
* latest usj_generator
	modified:   rust-usfm-parser/src/main.rs
	modified:   rust-usfm-parser/src/parser.rs
	modified:   rust-usfm-parser/src/usj_generator.rs
	modified:   rust-usfm-parser/tests/test_usj.rs

* cahnged node2_usj
	modified:   rust-usfm-parser/src/main.rs
	modified:   rust-usfm-parser/src/usj_generator.rs

* usj_generator changed
	modified:   rust-usfm-parser/src/usj_generator.rs

* small changes in node_2_usj
	modified:   rust-usfm-parser/src/usj_generator.rs

* debug usj_generator.rs
	modified:   rust-usfm-parser/src/usj_generator.rs

* major bug that caused repetition of fixed
	modified:   rust-usfm-parser/src/usj_generator.rs
* latest usj_generator
	modified:   rust-usfm-parser/src/main.rs
	modified:   rust-usfm-parser/src/parser.rs
	modified:   rust-usfm-parser/src/usj_generator.rs
	modified:   rust-usfm-parser/tests/test_usj.rs

* cahnged node2_usj
	modified:   rust-usfm-parser/src/main.rs
	modified:   rust-usfm-parser/src/usj_generator.rs

* usj_generator changed
	modified:   rust-usfm-parser/src/usj_generator.rs

* small changes in node_2_usj
	modified:   rust-usfm-parser/src/usj_generator.rs

* debug usj_generator.rs
	modified:   rust-usfm-parser/src/usj_generator.rs

* major bug that caused repetition of fixed
	modified:   rust-usfm-parser/src/usj_generator.rs

* Update usj_generator.rs
* threading

* implimented threading in the new code

* updated gitignore

* implimented threading in the new code
* Update grammar to support char in rem

* Bump version: 3.0.0 → 3.0.1-alpha.1

* Add new test case to check char in rem support
* Fix issue of missing contents when paras are excluded

* Always include BCV when converting to List

* Copy changes in node to web also

* Test and fix

* Bump version: 3.0.1-alpha.1 → 3.0.1-alpha.2

* Include link to issue in \rem test case
* Implement biblenlp to usfm conversion

* Add a test case for BibleNlp to USFM conversion

* Include BibleNLP's default vref in code base

* Implement CLI for in_format biblenlp

* Add usage instruction for from_biblenlp option in README

* Fix some linting issues

* Fix more linting issues

* Copy biblenlp to usfm conversion logic to node

* Add tests in node

* Minor test fix

* Copy the conversion logic to web module

* Tests for web module

* Documentation for BibleNlp input usage in web and node modules

* Bump version: 3.0.1-alpha.2 → 3.0.1-alpha.3

* Correct Readmes

* Correct Readmes

* Add links to inner readme files from the outer readme

* Allow optional bookcode arg when using full vref and/or texts

* Allow optional bookcode arg when using full vref in Node

* Allow optional bookcode arg when using full vref in Web module

* Test and fix node and web

* Allow text list to be of length 23213, 31170 or 41899

* Handle the changes in python CLI

* Ignore pylint warning

* Ignore pylint warning

* Ignore pylint warning
Bumps [tar-fs](https://github.com/mafintosh/tar-fs) from 2.1.1 to 2.1.2.
- [Commits](mafintosh/tar-fs@v2.1.1...v2.1.2)

---
updated-dependencies:
- dependency-name: tar-fs
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Fix the USJ conversion when there is nestedchar in footnote. In python

* Add a test case with char in footnote with and without nesting

* Fix the USX conversion in python

* Fix the issue in node and web

* Update grammar to allow empty \ft and other notes elements

* update the usj output in testsuite
* Output version in cli

* Bump version: 3.0.1-alpha.3 → 3.0.1-alpha.4

* Validate the usfm starts with \ before parsing

* Exempt empty file tests

* Change to_list() to generate reference list and change how include exclude works

* Implement toList() changes in node

* Implement toList() changes in web

* Update error message in test for usfm sanity check

* Bump version: 3.0.1-alpha.4 → 3.1.0

* Handle linting issues
Upgrades the versions of github actions due to the deprecation.
- upload-artifact@v3 to upload-artifacts@v4
- download-artifacts@v3 to download-artifacts@v4
* fix the special handling of c and v not reachable

* List xmldom as a peerdependency

* Bump version: 3.1.0 → 3.1.1
* Update grammar to allow v and empty string in table cell

* Bump version: 3.1.1 → 3.1.2

* Add test for empty table cell and v in table

* In USX gen, to add verse end, find the previous rows last cell when in for cell of current row

* save state of last versetext parent and sid in python

* Implement the fix in node and web

* Minor change
Bumps [tar-fs](https://github.com/mafintosh/tar-fs) from 2.1.2 to 2.1.3.
- [Commits](https://github.com/mafintosh/tar-fs/commits)

---
updated-dependencies:
- dependency-name: tar-fs
  dependency-version: 2.1.3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Update grammar to support \q4

* Change parse tree outputs in grammar tests

* Add a test case for \q4

* Let \liv, \th, \thr, \tc, \tcr,
etc have unlimited numbering

* Let \h have unrestricted numbering
* Add grammar rules for \tcc

* Update python methods to include \tcc

* Add testcase for \tcc

* Update web module to support \tcc

* Update node module to support \tcc

* Fix linting issues

* Fix typo
* Update testpypi_publish.yml

* Update testpypi_publish.yml

* Update pypi_publish.yml

* Update pypi_publish.yml

* Update pypi_publish.yml

* Update pypi_publish.yml

* Update pypi_publish.yml
* Allow multiple \sr and \r markers under one \s#

* Bump version: 3.1.2 → 3.1.3
Bumps [deepdiff](https://github.com/seperman/deepdiff) from 7.0.1 to 8.6.1.
- [Release notes](https://github.com/seperman/deepdiff/releases)
- [Changelog](https://github.com/seperman/deepdiff/blob/master/docs/changelog.rst)
- [Commits](https://github.com/seperman/deepdiff/commits)

---
updated-dependencies:
- dependency-name: deepdiff
  dependency-version: 8.6.1
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [jupyterlab](https://github.com/jupyterlab/jupyterlab) from 3.6.8 to 4.4.8.
- [Release notes](https://github.com/jupyterlab/jupyterlab/releases)
- [Changelog](https://github.com/jupyterlab/jupyterlab/blob/main/RELEASE.md)
- [Commits](https://github.com/jupyterlab/jupyterlab/compare/@jupyterlab/vdom@3.6.8...@jupyterlab/lsp@4.4.8)

---
updated-dependencies:
- dependency-name: jupyterlab
  dependency-version: 4.4.8
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [tar-fs](https://github.com/mafintosh/tar-fs) from 2.1.3 to 2.1.4.
- [Commits](mafintosh/tar-fs@v2.1.3...v2.1.4)

---
updated-dependencies:
- dependency-name: tar-fs
  dependency-version: 2.1.4
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented Apr 28, 2026

CodeAnt AI is reviewing your PR.


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai codeant-ai Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files label Apr 28, 2026
@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented Apr 28, 2026

CodeAnt AI is running the review.


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai codeant-ai Bot added size:XXL This PR changes 1000+ lines, ignoring generated files and removed size:XXL This PR changes 1000+ lines, ignoring generated files labels Apr 28, 2026
@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented Apr 28, 2026

Sequence Diagram

This diagram shows how the new Rust USFMParser accepts multiple input formats, builds a syntax tree with tree sitter, and converts the content into USJ, USX, table, USFM, or BibleNLP formats with optional marker filtering.

sequenceDiagram
    participant Client
    participant USFMParser
    participant TreeSitter
    participant Converters

    Client->>USFMParser: Create from USFM USJ USX or BibleNLP data
    USFMParser->>TreeSitter: Parse USFM string into syntax tree and collect errors
    TreeSitter-->>USFMParser: Syntax tree plus errors warnings
    Client->>USFMParser: Request output in chosen format
    USFMParser->>USFMParser: Apply marker filters and combine adjacent text
    USFMParser->>Converters: Convert filtered USJ to USJ USX table USFM or BibleNLP
    Converters-->>USFMParser: Target format data
    USFMParser-->>Client: Return converted output or write files
Loading

Generated by CodeAnt AI

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented Apr 28, 2026

CodeAnt AI finished reviewing your PR.

Comment on lines +429 to +437
let (vrefs, texts): (Vec<&str>, Vec<&str>) = if let Some(ref bc) = book_code_upper {
let pairs: Vec<(&str, &str)> = input
.vref
.iter()
.zip(input.text.iter())
.filter(|(r, _)| r.trim().to_uppercase().starts_with(bc.as_str()))
.map(|(r, t)| (r.as_str(), t.as_str()))
.collect();
pairs.iter().map(|(r, t)| (*r, *t)).unzip()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Architect Review — HIGH

In biblenlp_to_usfm, when a book_code is provided, vref and text are zipped and filtered before the length check, so mismatched input lists can silently drop extra verses instead of raising the intended length/versification error.

Suggestion: Mirror the Python logic by validating the original vref/text lengths before zipping, and when filtering by book_code, filter both lists in lockstep and then enforce a strict length equality check that raises a ParsingError on mismatch instead of truncating.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is an **Architect / Logical Review** comment left during a code review. These reviews are first-class, important findings — not optional suggestions. Do NOT dismiss this as a 'big architectural change' just because the title says architect review; most of these can be resolved with a small, localized fix once the intent is understood.

**Path:** rust-usfm-parser/src/usfm_generator.rs
**Line:** 429:437
**Comment:**
	*HIGH: In biblenlp_to_usfm, when a book_code is provided, vref and text are zipped and filtered before the length check, so mismatched input lists can silently drop extra verses instead of raising the intended length/versification error.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
If a suggested approach is provided above, use it as the authoritative instruction. If no explicit code suggestion is given, you MUST still draft and apply your own minimal, localized fix — do not punt back with 'no suggestion provided, review manually'. Keep the change as small as possible: add a guard clause, gate on a loading state, reorder an await, wrap in a conditional, etc. Do not refactor surrounding code or expand scope beyond the finding.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

Comment on lines +615 to +620
let align = if style.contains("tcc") || style.contains("thc") {
"center"
} else if style.ends_with('r') {
"end"
} else {
"start"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Architect Review — HIGH

In USXGenerator::node_2_usx_table, table cell alignment uses style.ends_with('r'), so numbered right-aligned markers like "tcr3" and "thr3" are misclassified as "start" rather than "end", diverging from the USJGenerator alignment logic and the tests' expectations.

Suggestion: Normalize the alignment rule for table cells to match USJGenerator (e.g. detect right alignment with style.contains('r') while keeping the special tcc/thc center check) so numbered tcr*/thr* styles are correctly marked as end-aligned in USX output.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is an **Architect / Logical Review** comment left during a code review. These reviews are first-class, important findings — not optional suggestions. Do NOT dismiss this as a 'big architectural change' just because the title says architect review; most of these can be resolved with a small, localized fix once the intent is understood.

**Path:** rust-usfm-parser/src/usx_generator.rs
**Line:** 615:620
**Comment:**
	*HIGH: In USXGenerator::node_2_usx_table, table cell alignment uses style.ends_with('r'), so numbered right-aligned markers like "tcr3" and "thr3" are misclassified as "start" rather than "end", diverging from the USJGenerator alignment logic and the tests' expectations.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
If a suggested approach is provided above, use it as the authoritative instruction. If no explicit code suggestion is given, you MUST still draft and apply your own minimal, localized fix — do not punt back with 'no suggestion provided, review manually'. Keep the change as small as possible: add a guard clause, gate on a loading state, reorder an await, wrap in a conditional, etc. Do not refactor surrounding code or expand scope beyond the finding.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

Comment on lines +16 to +19
fn punct_no_space_after() -> &'static Regex {
static RE: OnceLock<Regex> = OnceLock::new();
RE.get_or_init(|| Regex::new(r#"[,.\-—/;:!?@$%^)}\]>»]$"#).unwrap())
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The "no-space-after" punctuation regex is inverted: it currently matches trailing closing punctuation like . and ,, which suppresses spaces after sentence punctuation and can merge words incorrectly during text combination. Use the opening-punctuation set for this check (as in the Python behavior) so spaces are only suppressed where appropriate. [logic error]

Severity Level: Critical 🚨
- ❌ BibleNLP text output merges words after sentence punctuation.
- ❌ USJ output with combine_texts=true mis-spaces verse text.
- ⚠️ Table/list exports with combine_texts=true may join tokens.
Steps of Reproduction ✅
1. Build and run the Rust CLI in `rust-usfm-parser/src/main.rs` with a real USFM file that
contains sentence-ending punctuation, e.g. `tests/usfmjsTests/psa_quotes/origin.usfm`
(verified in that file at `tests/usfmjsTests/psa_quotes/origin.usfm:13-16` with text
ending in `,` and `.`).

2. Invoke BibleNLP export so that the Rust code paths are exercised by running:
`usfm-grammar --infile tests/usfmjsTests/psa_quotes/origin.usfm --out_format biblenlp
--ignore_errors` which hits the `match out_format` arm `"biblenlp"` in
`rust-usfm-parser/src/main.rs:27-48` and calls `parser.to_biblenlp_format(ignore_errors)`
at `main.rs:41`.

3. Inside `USFMParser::to_biblenlp_format` in
`rust-usfm-parser/src/usfm_parser.rs:380-401`, the parser builds a USJ tree and then calls
`include_markers_in_usj(&usj_root, &keep, true)` at `usfm_parser.rs:17-19`, passing
`combine_texts = true`, which in turn causes `include_inner` in
`rust-usfm-parser/src/filter.rs:77-146` to call `combine_consecutive_text_contents` at
`filter.rs:128-129`.

4. In `combine_consecutive_text_contents` (`rust-usfm-parser/src/filter.rs:37-67`), the
decision to insert a space between accumulated `buffer` and incoming `text` uses
`!punct_no_space_before().is_match(&text) && !punct_no_space_after().is_match(&buffer)` at
`filter.rs:47-48`. However, `punct_no_space_after` is currently defined as
`Regex::new(r#"[,.\-—/;:!?@$%^)}\]>»]$"#)` at `filter.rs:16-18`, which matches *closing*
punctuation like `.` and `,`. As a result, when a verse fragment ends with `.` (e.g. the
`mockers.` at `psa_quotes/origin.usfm:15`) and the next fragment starts with a word,
`punct_no_space_after().is_match(&buffer)` returns true and `add_space` becomes false,
producing concatenated output like `mockers.Blessed` in the generated BibleNLP text
instead of `mockers. Blessed`. Comparing to the reference JavaScript implementation in
`web-usfm-parser/src/filters.js:12-16` shows that `punctPatternNoSpaceAfter` there
intentionally matches *opening* punctuation (`/[\-—/`@^&({[<"«]$/`), so the Rust version's
use of closing punctuation is inverted and causes incorrect space suppression after
sentence punctuation.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** rust-usfm-parser/src/filter.rs
**Line:** 16:19
**Comment:**
	*Logic Error: The "no-space-after" punctuation regex is inverted: it currently matches trailing closing punctuation like `.` and `,`, which suppresses spaces after sentence punctuation and can merge words incorrectly during text combination. Use the opening-punctuation set for this check (as in the Python behavior) so spaces are only suppressed where appropriate.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +117 to +119
} else if is_biblenlp {
Err("BibleNLP input is not yet supported in this CLI. \
Supply a USFM, USJ, or USX file instead.".into())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: --in_format biblenlp is advertised as a valid input format, but this branch always returns an error instead of constructing a parser from BibleNLP data. This breaks the CLI contract and makes BibleNLP input unusable. Implement BibleNLP parsing here (including reading vref data) and call USFMParser::from_biblenlp(...) instead of hard-failing. [logic error]

Severity Level: Critical 🚨
- ❌ CLI cannot import BibleNLP data via --in_format biblenlp.
- ⚠️ Rust CLI diverges from BibleNLP support promised in PR.
- ⚠️ Prevents round-tripping BibleNLP → USFM in Rust tool.
Steps of Reproduction ✅
1. Build the Rust CLI binary for this crate (e.g. `cargo build -p rust-usfm-parser`) so
that `rust-usfm-parser/src/main.rs` is compiled into an executable using `main()` at lines
130–404.

2. Run the CLI with an explicit BibleNLP input format, for example:
`target/debug/rust-usfm-parser --in_format biblenlp some_input.txt` (the `in_format` flag
is accepted because it is registered with `value_parser(["usfm", "usj", "usx",
"biblenlp"])` in `main` at lines 153–157).

3. In `main` at lines 243–249, the program calls `handle_input(infile,
Some(in_format.unwrap_or("usfm")))`, passing `"biblenlp"` as `in_format`; inside
`handle_input` (lines 92–124), `fmt_str` is computed and `is_biblenlp` is set to `true` at
line 107 when `fmt_str.as_deref() == Some("biblenlp")`.

4. The `else if is_biblenlp` branch at lines 117–119 is taken and returns `Err("BibleNLP
input is not yet supported in this CLI. Supply a USFM, USJ, or USX file
instead.".into())`, which `main` treats as a fatal error at lines 243–249
(`eprintln!("Error: {}", e); process::exit(1);`), so any `--in_format biblenlp` invocation
fails even though the flag is advertised as valid input and
`USFMParser::from_biblenlp(...)` exists in `rust-usfm-parser/src/usfm_parser.rs:20–28`.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** rust-usfm-parser/src/main.rs
**Line:** 117:119
**Comment:**
	*Logic Error: `--in_format biblenlp` is advertised as a valid input format, but this branch always returns an error instead of constructing a parser from BibleNLP data. This breaks the CLI contract and makes BibleNLP input unusable. Implement BibleNLP parsing here (including reading vref data) and call `USFMParser::from_biblenlp(...)` instead of hard-failing.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +310 to +320
for row in rows {
let line = [
row.book.as_str(),
row.chapter.as_str(),
row.verse.as_str(),
row.text.as_str(),
row.r#type.as_str(),
row.marker.as_str(),
]
.join(col_sep);
write!(out, "{}{}", line, row_sep).unwrap();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Table output is generated by raw string joining, so any field containing the chosen separator or row separator will corrupt the table structure (shifted columns or broken rows). Use a proper CSV/TSV writer with escaping/quoting instead of manual join. [logic error]

Severity Level: Major ⚠️
- ⚠️ Table/CSV exports can be malformed when text has separators.
- ⚠️ Downstream spreadsheet imports may misalign verse columns.
- ⚠️ Rust CLI behavior diverges from Python csv.writer implementation.
Steps of Reproduction ✅
1. Build the Rust CLI binary for this crate (e.g. `cargo build -p rust-usfm-parser`) so
`main()` in `rust-usfm-parser/src/main.rs:130–404` is used.

2. Prepare an input that leads to a `ListRow` whose `text` includes the chosen column or
row separator; for example, create a small USFM file under `tests-data/` where a verse
text contains a literal tab character (`\t`) or newline, which `USFMParser::to_list(...)`
ultimately copies into `ListRow.text` without stripping (see
`rust-usfm-parser/src/list_generator.rs:141–169` where `text.to_string()` is assigned
directly).

3. Run the CLI with table output: `target/debug/rust-usfm-parser --out_format table
--csv_col_sep "\t" --csv_row_sep "\n" path/to/that.usfm`. In `main` at lines 297–307, the
program calls `parser.to_list(...)` and iterates `rows` at lines 308–320.

4. For each row, `main.rs` lines 311–319 build `line` by `[..].join(col_sep)` and then
write it with `write!(out, "{}{}", line, row_sep)` at line 320; when `row.text` contains
`col_sep` or `row_sep`, the output row gains extra separators or embedded newlines, so
downstream CSV/TSV consumers (e.g. Python's `csv.reader` or spreadsheet tools) see shifted
columns or split rows instead of one logical record—unlike the Python CLI, which uses
`csv.writer` with proper quoting in `py-usfm-parser/src/usfm_grammar/__main__.py:18–30`.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** rust-usfm-parser/src/main.rs
**Line:** 310:320
**Comment:**
	*Logic Error: Table output is generated by raw string joining, so any field containing the chosen separator or row separator will corrupt the table structure (shifted columns or broken rows). Use a proper CSV/TSV writer with escaping/quoting instead of manual `join`.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +379 to +381
let v_end = append_empty(parent, "verse");
v_end.set_attr("eid", eid.as_str());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Verse end milestones are closed on the current parent instead of the paragraph/row that actually owned the previous verse content. When a verse boundary crosses into a new paragraph or table row, this will attach the eid to the wrong node and produce incorrect USX verse structure. Track and use the previous verse container (like the Python prev_verse_parent) when emitting the previous eid. [logic error]

Severity Level: Critical 🚨
- ❌ USFM-to-USX conversion misplaces verse closing milestones.
- ⚠️ Round-trip conversions can desynchronize verse boundaries.
Steps of Reproduction ✅
1. Construct a `USXGenerator` with USFM input where a verse ends in one paragraph and the
next `\v` marker starts a new paragraph (e.g., `\p \v 1 ...` followed by `\p \v 2 ...` in
the same chapter) and obtain a tree-sitter AST root node.

2. Call `USXGenerator::get_usx` at `rust-usfm-parser/src/usx_generator.rs:142`, which
dispatches to `node_2_usx_chapter` for the chapter node at
`rust-usfm-parser/src/usx_generator.rs:293`.

3. Inside `node_2_usx_chapter` (lines 293–301), the first `paragraph` child is handled by
`node_2_usx` with `parent` pointing to the USX root; `node_2_usx_para` (lines 460–514)
creates the first `<para>` and, when it encounters `v 1`, calls `node_2_usx_verse` (lines
376–433) with `parent` pointing to that first `<para>`, setting `prev_verse_sid_to_close =
Some("BOOK 1:1")`.

4. When the second `paragraph` (containing `v 2`) is processed, `node_2_usx_para` creates
a second `<para>`; its `v 2` node again calls `node_2_usx_verse`, but at entry the
`parent` now points to the second `<para>`. The code at lines 379–381 appends `<verse
eid="BOOK 1:1"/>` to the second `<para>`, so the end milestone for verse 1 is attached to
the wrong container instead of the paragraph or row that actually contained verse 1's
content.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** rust-usfm-parser/src/usx_generator.rs
**Line:** 379:381
**Comment:**
	*Logic Error: Verse end milestones are closed on the current `parent` instead of the paragraph/row that actually owned the previous verse content. When a verse boundary crosses into a new paragraph or table row, this will attach the `eid` to the wrong node and produce incorrect USX verse structure. Track and use the previous verse container (like the Python `prev_verse_parent`) when emitting the previous `eid`.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +213 to +214
// #[test]
fn test_compare_usx_with_testsuite_samples() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The main test that compares generated USX with testsuite reference XML is not annotated as a test, so it never runs. This silently removes regression coverage for exact USX output differences. Re-enable the test annotation so comparison failures are caught in CI. [code quality]

Severity Level: Critical 🚨
-`cargo test` omits USX-vs-origin.xml regression coverage.
- ⚠️ Structural USX changes may ship without detection.
- ⚠️ Differences in normalized whitespace may go unnoticed.
- ⚠️ Bible testsuite origin.xml alignment not enforced by Rust tests.
Steps of Reproduction ✅
1. Open `rust-usfm-parser/tests/test_usx_conversion.rs` and locate the section `//
test_compare_usx_with_testsuite_samples` starting at line 146.

2. Observe at lines 213–214 that the test attribute is commented out: `// #[test]`
followed by `fn test_compare_usx_with_testsuite_samples() {`, meaning this function is a
plain helper and not a Rust test.

3. Search for usages with `rg "test_compare_usx_with_testsuite_samples" -n
/workspace/usfm-grammar` and confirm that the only occurrence is this definition; no other
test or module calls it, so it is never executed during `cargo test`.

4. Run the Rust test suite for this crate (e.g., `cd rust-usfm-parser && cargo test`) and
note that only `test_successful_usx_conversion` (lines 73–89),
`test_all_markers_are_in_usx_output` (lines 95–143), and `test_usx_round_tripping` (lines
294–339) execute; there is no output or failure related to exact USX-vs-origin.xml
comparison even though `test_compare_usx_with_testsuite_samples` contains that logic at
lines 215–287, so regressions in exact USX serialization can pass unnoticed.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** rust-usfm-parser/tests/test_usx_conversion.rs
**Line:** 213:214
**Comment:**
	*Code Quality: The main test that compares generated USX with testsuite reference XML is not annotated as a test, so it never runs. This silently removes regression coverage for exact USX output differences. Re-enable the test annotation so comparison failures are caught in CI.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +221 to +223
if !parser.errors.is_empty() {
continue;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This test silently skips files that fail parsing instead of failing the test, which can hide real regressions in marker preservation. Treat parse errors as failures here so a valid fixture that starts failing does not get ignored. [logic error]

Severity Level: Major ⚠️
- ⚠️ Marker coverage test may pass while fixture parsing fails.
- ⚠️ Parser regressions may slip through CI unnoticed.
Steps of Reproduction ✅
1. Run `cargo test test_usj_all_markers_are_in_output` in `rust-usfm-parser/` so the test
at `rust-usfm-parser/tests/test_json_conversion.rs:214-246` executes.

2. Inside this test, the loop over `positive_files()` (constructed from `negative_tests()`
in `rust-usfm-parser/tests/common.rs:126-131`) initializes a parser for each `origin.usfm`
via `initialise_parser` at `rust-usfm-parser/tests/common.rs:176-181`.

3. For any supposedly "positive" fixture where `USFMParser::new` produces errors,
`parser.errors` is non-empty and the guard at `test_json_conversion.rs:221-223` (`if
!parser.errors.is_empty() { continue; }`) skips that file without pushing anything into
the local `failures` vector.

4. The final assertion at `test_json_conversion.rs:245` only checks `failures.is_empty()`,
so the test still passes even though at least one "valid" USFM fixture failed to parse and
its marker preservation was never validated (unlike the Python original at
`py-usfm-parser/tests/test_json_conversion.py:12-21`, which asserts on
`test_parser.errors`).

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** rust-usfm-parser/tests/test_json_conversion.rs
**Line:** 221:223
**Comment:**
	*Logic Error: This test silently skips files that fail parsing instead of failing the test, which can hide real regressions in marker preservation. Treat parse errors as failures here so a valid fixture that starts failing does not get ignored.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +260 to +262
if !parser1.errors.is_empty() {
continue;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: In the round-trip test, parse errors in the source file are ignored with continue, so the test can falsely pass without exercising USJ→USFM round-tripping for affected fixtures. Record these as failures instead of skipping. [logic error]

Severity Level: Critical 🚨
- ⚠️ Round-trip conversion regressions may not be detected.
- ⚠️ USJ→USFM pipeline untested for some valid fixtures.
Steps of Reproduction ✅
1. Run `cargo test test_usj_round_tripping` so the round-trip test at
`rust-usfm-parser/tests/test_json_conversion.rs:255-291` executes.

2. For each file in `positive_files()` (from `tests/common.rs:126-131`), the test calls
`initialise_parser(&path)` at `tests/common.rs:176-181`, creating `parser1` with any parse
diagnostics stored in `parser1.errors`.

3. If a "positive" USFM fixture now produces parse errors (e.g. due to a regression in
`USFMParser::new`), the guard at `test_json_conversion.rs:260-262` (`if
!parser1.errors.is_empty() { continue; }`) skips that fixture entirely, so neither
`parser1.to_usj` (lines 264-269) nor `USFMParser::from_usj` and the re-parse at lines
272-282 are exercised.

4. Because no entry is added to the local `failures` vector for such skipped fixtures, the
final `assert!(failures.is_empty())` at `test_json_conversion.rs:291` still passes,
meaning USFM→USJ→USFM round-tripping is not validated for the failing files (in contrast
to the Python test at `py-usfm-parser/tests/test_json_conversion.py:48-57`, which asserts
on `test_parser1.errors`).

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** rust-usfm-parser/tests/test_json_conversion.rs
**Line:** 260:262
**Comment:**
	*Logic Error: In the round-trip test, parse errors in the source file are ignored with `continue`, so the test can falsely pass without exercising USJ→USFM round-tripping for affected fixtures. Record these as failures instead of skipping.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +342 to +344
if !parser.errors.is_empty() {
continue;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This comparison test also skips files when parsing fails, which hides regressions and makes results unreliable. The test should fail for unexpected parser errors on positive fixtures. [logic error]

Severity Level: Major ⚠️
- ⚠️ USJ vs origin.json mismatches may go untested.
- ⚠️ Parser failures can hide broken JSON comparison logic.
Steps of Reproduction ✅
1. Run `cargo test test_compare_usj_with_testsuite_samples` so the comparison test at
`rust-usfm-parser/tests/test_json_conversion.rs:333-380` executes.

2. For each path returned by `positive_files()` (from `tests/common.rs:126-131`), the test
constructs `parser` via `initialise_parser(&path)` at `tests/common.rs:176-181`,
populating `parser.errors` if the USFM fails to parse.

3. When any "positive" fixture yields non-empty `parser.errors`, the guard at
`test_json_conversion.rs:342-344` (`if !parser.errors.is_empty() { continue; }`) skips
that fixture, so no USJ is generated and no comparison to `origin.json` is performed for
that file.

4. Since skipped files do not add entries to the `failures` vector, the final assertion at
`test_json_conversion.rs:379` (`assert!(failures.is_empty(), ...)`) can still pass,
meaning the test suite may report success while some "valid" fixtures both fail to parse
and never have their USJ compared to the reference JSON (whereas the Python test at
`py-usfm-parser/tests/test_json_conversion.py:198-215` asserts on `test_parser.errors`
before comparison).

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** rust-usfm-parser/tests/test_json_conversion.rs
**Line:** 342:344
**Comment:**
	*Logic Error: This comparison test also skips files when parsing fails, which hides regressions and makes results unreliable. The test should fail for unexpected parser errors on positive fixtures.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +353 to +359
let mut origin_usj: Value = match fs::read_to_string(&usj_path) {
Ok(text) => match serde_json::from_str(&text) {
Ok(v) => v,
Err(_) => continue,
},
Err(_) => continue, // no reference file — skip
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Errors while reading or parsing origin.json are swallowed and treated as skips, so missing/corrupt reference fixtures are never reported and the test can pass without performing the intended comparison. Only skip missing files if that is intended; malformed JSON should fail the test. [logic error]

Severity Level: Major ⚠️
- ⚠️ Corrupted origin.json fixtures are silently ignored in tests.
- ⚠️ USJ comparison coverage weaker than Python test suite.
Steps of Reproduction ✅
1. Run `cargo test test_compare_usj_with_testsuite_samples` to execute the test at
`rust-usfm-parser/tests/test_json_conversion.rs:333-380`.

2. For each "positive" fixture path, the test computes `usj_path =
path.with_file_name("origin.json")` at line 352 and then executes the nested `match` at
lines 353-359, calling `fs::read_to_string(&usj_path)` and `serde_json::from_str(&text)`.

3. If the `origin.json` file exists but is malformed JSON (e.g. truncated or hand-edited
incorrectly), `serde_json::from_str` returns `Err`, and the inner arm at
`test_json_conversion.rs:354-357` executes `Err(_) => continue`, causing the test to
silently skip that fixture instead of recording a failure in the `failures` vector.

4. Because missing files are also skipped via the outer `Err(_) => continue` at line 358
(matching the Python test's `FileNotFoundError` handling at
`py-usfm-parser/tests/test_json_conversion.py:208-217`), the final
`assert!(failures.is_empty())` at `test_json_conversion.rs:379` can still succeed even
when a reference `origin.json` is present but invalid and never actually compared.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** rust-usfm-parser/tests/test_json_conversion.rs
**Line:** 353:359
**Comment:**
	*Logic Error: Errors while reading or parsing `origin.json` are swallowed and treated as skips, so missing/corrupt reference fixtures are never reported and the test can pass without performing the intended comparison. Only skip missing files if that is intended; malformed JSON should fail the test.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +187 to +192
let val_str = val.as_str().unwrap_or("");
if key == "file" {
self.usfm_string.push_str(&format!("src=\"{}\" ", val_str));
} else {
self.usfm_string.push_str(&format!("{}=\"{}\" ", key, val_str));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Attribute values are written into quoted USFM attributes without escaping embedded quotes, so a value containing " will break attribute syntax and can corrupt the regenerated USFM. Sanitize or escape quotes in USJ attribute values before interpolation, consistently with the USX conversion path. [possible bug]

Severity Level: Major ⚠️
- ❌ CLI USJ→USX fails when attributes contain double quotes.
- ⚠️ USJ→USFM round-trip produces syntactically malformed USFM.
- ⚠️ Extra attributes after quote lost or misparsed on parse.
Steps of Reproduction ✅
1. Prepare a USJ JSON file for the CLI input, using the format consumed in
`rust-usfm-parser/src/main.rs:96-18` when `is_json` is true (extension `.json` or `.usj`).
Include at least one object with an extra attribute whose value contains a double quote,
e.g.:

   ```json

   {

     "type": "char",

     "marker": "x",

     "content": [],

     "custom": "foo\"bar"

   }
  1. Run the rust-usfm-parser CLI (binary with main in
    rust-usfm-parser/src/main.rs:35-53) with this file as infile, --in_format=usj (or
    .json extension), --out_format=usx and without --ignore_errors, so is_json is
    true at main.rs:14-18 and USFMParser::from_usj(&usj) is called at main.rs:15-17.

  2. Inside USFMParser::from_usj (rust-usfm-parser/src/usfm_parser.rs:10-17), the code
    constructs a USFMGenerator and calls usj_to_usfm(usj, false)
    (usfm_generator.rs:100-104). For the USJ object with "custom": "foo"bar", the "custom" key is not in NON_ATTRIB_USJ_KEYS (usfm_generator.rs:15-18), so the "Extra attributes" block at usfm_generator.rs:178-195runs:val_strbecomes"foo"bar" (line 187), and self.usfm_string.push_str(&format!("custom="{}" ", val_str));atline 191-192producescustom="foo"bar "`. This string has an embedded double quote that prematurely
    closes the attribute value, making the generated USFM syntactically malformed.

  3. Still inside from_usj, the malformed USFM is passed to USFMParser::build
    (usfm_parser.rs:45-66), which parses it with tree-sitter and collects ERROR nodes into
    self.errors (usfm_parser.rs:67-79). When the CLI later calls
    parser.to_usx(ignore_errors) at rust-usfm-parser/src/main.rs:19-21 with
    ignore_errors false by default (main.rs:201-205,225), guard_errors at
    usfm_parser.rs:42-48 sees non-empty self.errors, returns an error, and the CLI prints
    the error and exits with code 1 at main.rs:21-29. Thus, any USJ input containing an
    extra attribute value with a double quote reliably breaks USJ→USFM→USX conversion.

</details>

[Fix in Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt=This%20is%20a%20comment%20left%20during%20a%20code%20review.%0A%0A%2A%2APath%3A%2A%2A%20rust-usfm-parser%2Fsrc%2Fusfm_generator.rs%0A%2A%2ALine%3A%2A%2A%20187%3A192%0A%2A%2AComment%3A%2A%2A%0A%09%2APossible%20Bug%3A%20Attribute%20values%20are%20written%20into%20quoted%20USFM%20attributes%20without%20escaping%20embedded%20quotes%2C%20so%20a%20value%20containing%20%60%22%60%20will%20break%20attribute%20syntax%20and%20can%20corrupt%20the%20regenerated%20USFM.%20Sanitize%20or%20escape%20quotes%20in%20USJ%20attribute%20values%20before%20interpolation%2C%20consistently%20with%20the%20USX%20conversion%20path.%0A%0AValidate%20the%20correctness%20of%20the%20flagged%20issue.%20If%20correct%2C%20How%20can%20I%20resolve%20this%3F%20If%20you%20propose%20a%20fix%2C%20implement%20it%20and%20please%20make%20it%20concise.%0AOnce%20fix%20is%20implemented%2C%20also%20check%20other%20comments%20on%20the%20same%20PR%2C%20and%20ask%20user%20if%20the%20user%20wants%20to%20fix%20the%20rest%20of%20the%20comments%20as%20well.%20if%20said%20yes%2C%20then%20fetch%20all%20the%20comments%20validate%20the%20correctness%20and%20implement%20a%20minimal%20fix%0A) | [Fix in VSCode Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt=This%20is%20a%20comment%20left%20during%20a%20code%20review.%0A%0A%2A%2APath%3A%2A%2A%20rust-usfm-parser%2Fsrc%2Fusfm_generator.rs%0A%2A%2ALine%3A%2A%2A%20187%3A192%0A%2A%2AComment%3A%2A%2A%0A%09%2APossible%20Bug%3A%20Attribute%20values%20are%20written%20into%20quoted%20USFM%20attributes%20without%20escaping%20embedded%20quotes%2C%20so%20a%20value%20containing%20%60%22%60%20will%20break%20attribute%20syntax%20and%20can%20corrupt%20the%20regenerated%20USFM.%20Sanitize%20or%20escape%20quotes%20in%20USJ%20attribute%20values%20before%20interpolation%2C%20consistently%20with%20the%20USX%20conversion%20path.%0A%0AValidate%20the%20correctness%20of%20the%20flagged%20issue.%20If%20correct%2C%20How%20can%20I%20resolve%20this%3F%20If%20you%20propose%20a%20fix%2C%20implement%20it%20and%20please%20make%20it%20concise.%0AOnce%20fix%20is%20implemented%2C%20also%20check%20other%20comments%20on%20the%20same%20PR%2C%20and%20ask%20user%20if%20the%20user%20wants%20to%20fix%20the%20rest%20of%20the%20comments%20as%20well.%20if%20said%20yes%2C%20then%20fetch%20all%20the%20comments%20validate%20the%20correctness%20and%20implement%20a%20minimal%20fix%0A)

*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>

```mdx
This is a comment left during a code review.

**Path:** rust-usfm-parser/src/usfm_generator.rs
**Line:** 187:192
**Comment:**
	*Possible Bug: Attribute values are written into quoted USFM attributes without escaping embedded quotes, so a value containing `"` will break attribute syntax and can corrupt the regenerated USFM. Sanitize or escape quotes in USJ attribute values before interpolation, consistently with the USX conversion path.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +308 to +319
let sid = format!(
"{} {}:{}",
self.parse_state.book_slug.as_deref().unwrap_or(""),
self.parse_state.current_chapter.as_deref().unwrap_or(""),
num
);

let mut v_obj = json!({
"type": "verse",
"marker": "v",
"number": num,
"sid": sid,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The verse reference id is always constructed and emitted even when book or chapter state is missing, which produces malformed references like empty-book/empty-chapter SIDs on partial or error-tolerant parses. Only set a SID when both required state values are available, otherwise omit it. [logic error]

Severity Level: Major ⚠️
- ⚠️ USJ from to_usj() can contain malformed verse sid fields.
- ⚠️ BibleNLP export via to_biblenlp_format() inherits invalid SIDs.
- ⚠️ Downstream tools consuming sid cannot rely on verse references.
Steps of Reproduction ✅
1. Prepare malformed or partial USFM that includes a `\v` verse marker without a preceding
`\id` book marker or `\c` chapter marker, so the tree-sitter parse cannot populate
book/chapter state (grammar for `id` and `chapter` is in
`tree-sitter-usfm3/grammar.js:45-48` and `146-154`).

2. In Rust, call `USFMParser::to_usj(exclude_markers, include_markers, ignore_errors =
true, combine_texts)` from `rust-usfm-parser/src/usfm_parser.rs:26-48` so that
`guard_errors()` allows generation despite parse errors caused by the missing markers.

3. `to_usj()` builds a `USJGenerator` and calls `get_usj(self.syntax_tree.root_node())`;
within `get_usj`, `node_2_usj()` (`rust-usfm-parser/src/usj_generator.rs:382-447`)
dispatches each `v` node to `node_2_usj_verse()` at
`rust-usfm-parser/src/usj_generator.rs:270-330`.

4. In `node_2_usj_verse()`, the `sid` is constructed at `usj_generator.rs:308-313` using
`self.parse_state.book_slug.as_deref().unwrap_or("")` and
`self.parse_state.current_chapter.as_deref().unwrap_or("")`, both still `None` because no
`id` or `c` node was processed, so the resulting USJ verse objects contain malformed
identifiers like `"sid": " :5"` with empty book and chapter components.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** rust-usfm-parser/src/usj_generator.rs
**Line:** 308:319
**Comment:**
	*Logic Error: The verse reference id is always constructed and emitted even when book or chapter state is missing, which produces malformed references like empty-book/empty-chapter SIDs on partial or error-tolerant parses. Only set a SID when both required state values are available, otherwise omit it.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +738 to +746
if nestable {
self.node_2_usj(child, &mut para_obj);
} else {
if !node_added {
Self::push_obj(parent, para_obj.clone());
node_added = true;
}
self.node_2_usj(child, parent);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The paragraph object is pushed as a clone the first time a non-nestable child is seen, and later nestable children are still appended to the local para_obj that is no longer connected to the output tree. This can silently drop trailing inline content from the generated USJ. Push the paragraph object once up front (or mutate the already-pushed node) so all nestable children are preserved. [logic error]

Severity Level: Critical 🚨
- ❌ USJ output from to_usj() drops text after figures.
- ❌ List output from to_list() omits trailing heading content.
- ⚠️ BibleNLP export via to_biblenlp_format() may miss title text.
Steps of Reproduction ✅
1. Create a USFM file that uses a generic paragraph-style marker such as `\cd` with inline
text, an embedded figure, and trailing text, for example: `\cd First part \fig ...\fig*
second part`. The `cd` rule allowing `fig` is defined in
`tree-sitter-usfm3/grammar.js:180-186`.

2. In Rust, construct a parser and call `USFMParser::to_usj(..)` from
`rust-usfm-parser/src/usfm_parser.rs:26-48` on this USFM, so it builds a syntax tree and
then invokes `USJGenerator::get_usj(self.syntax_tree.root_node())`.

3. Inside `USJGenerator::node_2_usj()` at `rust-usfm-parser/src/usj_generator.rs:382-447`,
the `cd` node is dispatched to `node_2_usj_generic()`
(`rust-usfm-parser/src/usj_generator.rs:306-352`), which creates a local `para_obj` and
then iterates its children.

4. In `node_2_usj_generic()`, the first `text` child is treated as `nestable` and appended
to `para_obj`, the `fig` child is treated as non-nestable and causes `para_obj.clone()` to
be pushed to `parent` at `usj_generator.rs:738-743`, and any subsequent `text` children
(the "second part") are again treated as `nestable` and appended only to the local
`para_obj` that is no longer pushed. Inspecting the `to_usj()` JSON output shows that text
after the `\fig` (e.g. "second part") is missing from the generated USJ.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** rust-usfm-parser/src/usj_generator.rs
**Line:** 738:746
**Comment:**
	*Logic Error: The paragraph object is pushed as a clone the first time a non-nestable child is seen, and later nestable children are still appended to the local `para_obj` that is no longer connected to the output tree. This can silently drop trailing inline content from the generated USJ. Push the paragraph object once up front (or mutate the already-pushed node) so all nestable children are preserved.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented Apr 28, 2026

CodeAnt AI finished running the review.


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Epic size:XXL This PR changes 1000+ lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants