The LatticeLabs toolkit is an ambitious monorepo comprising 6 packages for CAD document processing, neural STEP file understanding, geometric tokenization, optical CAD recognition, generative CAD, and point cloud processing. The codebase demonstrates strong architectural patterns (lazy imports, pipeline decomposition, vectorized numerics) but has critical security issues in code execution sandboxes, several correctness bugs in training loops and data pipelines, and significant performance bottlenecks in Python-loop-heavy geometry processing. The ll_clouds package is entirely empty, and ll_ocadr has zero pytest tests.
-
ProvenanceItem.timestampuses naivedatetime.now()(cadling/cadling/datamodel/base_models.py:197) — Timezone-ambiguous. -
Placeholder GitHub URL in pyproject.toml (
cadling/pyproject.toml:108-111) —yourusername/cadling. -
export_to_markdownadds"..."unconditionally (cadling/cadling/datamodel/base_models.py:579). -
DocumentConverter test file is untracked — Main entry point has no committed test coverage.
-
mask_tokensdocstring swaps parameter semantics (ll_stepnet/stepnet/pretrain.py:423-445) —random_probdescribed as "replace with random" but means "keep original". -
Positional encoding as
nn.Parameterwithout documentation (ll_stepnet/stepnet/encoder.py:58-60) — If fixed sinusoidal, should beregister_buffer. -
No tests for pretrain models or coedge builder — Core self-supervised objectives and topology reconstruction have zero test coverage.
-
ll_ocadrhas zero pytest tests —test_ll_ocadr.pyis a manual CLI script, not a test suite. -
step_process.pyusesprint()instead of logging (ll_ocadr/vllm/process/step_process.py:73,89,94). -
_resolve_devicemisleading log message (ll_gen/ll_gen/generators/base.py:184-186) — Logs "CUDA not available; falling back to CPU" when CPU was explicitly requested. -
f-strings in
_log.debugcalls — Eagerly evaluated even when log level filters output. Multiple files. -
_build_spatial_hashdead code (geotoken/geotoken/quantization/adaptive.py:310-329) — Never called, replaced by_build_collision_groups. -
PointPatchEmbedding.patch_sizeconstructor param ignored (ll_ocadr/vllm/lattice_encoder/shape_net.py:18,63) —forwardhardcodesnum_patches = 256. -
repairable_rewarddefaults to 0.0 (ll_gen/ll_gen/config.py:209) — Field and code path are wasteful noise.
-
Well-structured three-stage pipeline (
cadling/pipeline/base_pipeline.py) — Build -> Assemble -> Enrich mirrors proven docling architecture with per-stage timing and graceful failure handling. -
Subprocess-based execution isolation (
ll_gen/disposal/code_executor.py) — Running user CAD code insubprocess.run()with file-based IPC and defense-in-depth restricted builtins sandbox. -
Consistent lazy import strategy — All packages uniformly guard heavy optional deps (pythonocc, trimesh, torch, mlx) via try/except with boolean flags and graceful degradation.
-
Sparse adjacency matrices throughout GCN pipeline (
ll_stepnet) — Correct use of sparse COO tensors prevents O(N^2) memory. Symmetric GCN normalization computed on sparse tensors without densifying. -
weights_only=Trueon alltorch.loadcalls (ll_stepnet) — Prevents arbitrary code execution from malicious checkpoints. -
Vectorized curvature computation (
geotoken/curvature.py) — Fully vectorized cotangent-weight Laplace-Beltrami usingnp.add.atscatter ops with degenerate-face masking. -
_build_collision_groupsO(n log n) approach (geotoken/adaptive.py) — Structured-array lexicographic sort replaces O(n^2) spatial hash. -
Streaming STEP file parsing (
ll_ocadr/file_content_chunker.py) — Line-by-line reading safe for multi-GB files. -
Clean global vs. local geometry encoding (
ll_ocadr/latticelabs_ocadr.py) — Two-path design (ShapeNet + GeometryNet with chunking) mirrors OCR image tiling. -
Reproducible random sampling (
cadling/sdg/qa/generate.py) — Dedicatedrandom.Random(seed)instance for concurrent SDG pipelines. -
Comprehensive graph encoder tests (
ll_stepnet/test_graph_encoder.py) — 10 test classes covering sparse/dense equivalence, gradient flow, device consistency, edge cases. -
LazyTopologyLoadercorrect double-checked locking (ll_stepnet/streaming_processor.py) — Avoids blocking, prevents duplicates. -
REINFORCE implementation (
ll_gen/generators/neural_vae.py) — Correctly samples from live computation graph and accumulates log-probs on exact trajectory. -
CADVocabularydeterministic token encoding (geotoken/vocabulary.py) — Partitioned ID space with explicit offset arithmetic and correct save/load round-trip. -
Binary STL detection via file-size validation (
ll_ocadr/file_content_chunker.py) — Handles the knownsolidprefix pitfall with the standard size-based heuristic.
-
Security (P0): Audit and harden all
exec()/ subprocess code paths incadling/generation/andll_gen/disposal/. Restrict__builtins__in in-process fallback. Pass file paths via env vars, not source interpolation. Validate input paths inDocumentConverter. -
Correctness (P0): Fix
total_loss > 0tensor comparison crash inpretrain.py. Add missing_logdefinition inencoder.py. FixCadQueryProposerkey mismatch. Fix temp file lifetime incode_executor.py. -
Testing (P1): Add pytest suites for
ll_ocadr(zero tests),ll_stepnet/pretrain.py(core objectives untested), and commit the untrackedtest_document_converter.py. Add coedge builder unit tests. -
Performance (P1): Vectorize vertex normal computation in
step_process.py. Batch encoder calls inlatticelabs_ocadr.py. Cap collision resolution radius inadaptive.py. Use FPS fromtorch_cluster. -
Dependencies (P2): Add
scipyto geotoken's declared deps. Tightenvllm>=0.2.0to a compatible range. Remove conditional numpy imports where numpy is a core dep. -
Architecture (P2): Share graph encoder between causal/masked LM heads in
STEPForHybridLM. Eliminate duplicate transformer stacks inSTEPTransformerDecoder. Sync lazy projection layers in all trainers, not justSTEPTrainer. -
Dead code (P2): Remove
_build_spatial_hashin geotoken,_timeout_handler/SIGALRM in ll_gen, duplicate top-levelbrep_backend.pyin cadling. -
ll_clouds (P3): Either implement the package or remove the empty scaffold.