Skip to content

Crash in vg pack -e -d due to invalid UTF-8 in vg.Edit.sequence (real GAM, Nanopore data) #4668

@florenmartino

Description

@florenmartino

First of all, thank you for your incredible tool. I'm relatively new to using vg, and while it's powerful, it's also quite complex. I'm not sure if what I'm encountering is a real bug or a problem specific to my workflow, but it's been extremely frustrating — I've tried everything I could think of without success.

1. What were you trying to do?

I'm trying to run a reference-based mapping pipeline using vg map and vg pack on a pangenome graph (GFA → VG) of circular single-stranded viruses of 3-4 kb apron. The pipeline maps long Nanopore reads to the graph and then uses vg pack to compute per-node coverage and edit information to summarize support for each path (reference genome).

2. What did you want to happen?

I expected vg pack -e -d to generate a coverage.tsv file containing both coverage and edit columns, so I can use the downstream R script that requires this information to rank which reference path was best supported by each sample.

3. What actually happened?

vg pack fails when the -e option is provided.

The command:

vg pack -x graph.xg -g sample.sorted.gam -o sample.pack
vg pack -x graph.xg -i sample.pack -e -d > sample_coverage.tsv

Produces the following error:

Annotation Filter:             0
Incorrectly Mapped Filter:     0
Max Reads Filter:              0

 break into sorted chunks       [========================================================================================]100.0%
 merge 6 files                  [========================================================================================]100.0%
[Packing]
[libprotobuf ERROR google/protobuf/wire_format_lite.cc:577] String field 'vg.Edit.sequence' contains invalid UTF-8 data when parsing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.
[libprotobuf ERROR google/protobuf/wire_format_lite.cc:577] String field 'vg.Edit.sequence' contains invalid UTF-8 data when parsing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.
[libprotobuf ERROR google/protobuf/wire_format_lite.cc:577] String field 'vg.Edit.sequence' contains invalid UTF-8 data when parsing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.
terminate called after throwing an instance of 'j2pb_error'
  what():  sequence: Fail to convert to json
━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.65.0 "Carfon"
Caught signal 6 raised at address 0x22dc7ec; tracing with backward-cpp
Stack trace (most recent call last):
#15   Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x6635d4, in _start
#14   Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x2298ab6, in __libc_start_main
#13   Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x2297219, in __libc_start_call_main
#12   Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0xf41e0b, in vg::subcommand::Subcommand::operator()(int, char**) const
#11   Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0xef9ff2, in main_pack(int, char**)
#10   Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x14129ac, in vg::Packer::as_table(std::ostream&, bool, std::vector<long long, std::allocator<long long> >)
#9    Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x1e8c200, in pb2json[abi:cxx11](google::protobuf::Message const&)
#8    Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x1e8c0c5, in _pb2json(google::protobuf::Message const&)
#7    Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x5ffd79, in _field2json(google::protobuf::Message const&, google::protobuf::FieldDescriptor const*, unsigned long) [clone .cold]
#6    Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x21d1038, in __cxa_throw
#5    Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x21d0ed6, in std::terminate()
#4    Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x21d0e6b, in __cxxabiv1::__terminate(void (*)())
#3    Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x61f42b, in __gnu_cxx::__verbose_terminate_handler() [clone .cold]
#2    Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x621b73, in abort
#1    Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x22afc95, in raise
#0    Object "/home/fmartino/miniconda3/envs/anellome/bin/vg", at 0x22dc7ec, in __pthread_kill

Library locations:
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
━━━━━━━━━━━━━━━━━━━━
Context dump:
        Thread 0: Starting 'pack' subcommand
Found 1 threads with context.
━━━━━━━━━━━━━━━━━━━━
Please include this entire error log in your bug report!
━━━━━━━━━━━━━━━━━━━━

  • I verified the GAM file with:
    vg view -a sample.gam | grep -P '[^\x00-\x7F]' → no non-UTF8 characters detected.
  • I rebuilt the graph using vg prune to remove any problematic snarls.
  • I tested with and without vg filter -q, -r, -P, etc. to ensure only good alignments were passed to vg pack.
  • I confirmed that vg pack works fine without the -e flag (generates .pack and basic coverage table).
  • I tried vg pack -e -d directly from the GAM file (skipping .pack) and got the same error.
  • I tested using multiple GAMs from different samples: same issue only when -e is used.
  • I updated protobuf and rebuilt vg in a clean environment, just in case it was a protobuf compatibility issue.

So far, only dropping -e prevents the crash, but then I can't compute edit distances required for downstream analysis.

This occurs only when -e is used. The .gam file is generated successfully and seems valid (I can run vg view, vg gamsort, vg pack without -e, and mapq summaries all work fine). Without -e, the pipeline completes, but I lose the edit column required for scoring.

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

N/A. No stacktrace file was generated.

5. What data and command can the vg dev team use to make the problem happen?

I’m mapping real Nanopore data against a circular viral GFA-based graph with pruned paths and xg/gcsa/gbwt built from it.

The relevant commands are:

vg convert -g ref.gfa > ref.raw.vg
vg prune ref.raw.vg > ref.vg
vg index -x ref.xg ref.vg
vg gbwt -x ref.xg -o ref.gbwt -P --pass-paths
vg index -g ref.gcsa ref.xg

vg map -f reads.fastq.gz -x ref.xg -g ref.gcsa -d ref -1 ref.gbwt -m long > sample.gam
vg gamsort sample.gam -p > sample.sorted.gam

# This works
vg pack -x ref.xg -g sample.sorted.gam -o sample.pack
vg pack -x ref.xg -i sample.pack -d > sample_coverage.tsv
# This fails
vg pack -x ref.xg -i sample.pack -e -d > sample_coverage.tsv

I tried rebuilding the graph with vg prune, confirmed that the GAM file is valid with vg view -a, and tested with and without vg filter. None of that fixed the crash with -e.

6. What does running vg version say?


vg version v1.65.0 "Carfon"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Using HTSlib headers 101990, library 1.19.1-29-g3cfe8769
Built by [email protected]


Additional Notes:

  • If I omit -e, the pipeline completes, but the final R scripts fail because the edits column is missing and I need it for my analysis.
  • I'm happy to share the GFA, XG, and GAM files if needed.

Let me know if I can help reproduce this further or test a patch.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions