Skip to content

feat: per-variant constructors for data enums in the dynamic backends#148

Merged
Goldziher merged 7 commits into
xberg-io:mainfrom
tobocop2:feat/data-enum-variant-constructors
Jun 25, 2026
Merged

feat: per-variant constructors for data enums in the dynamic backends#148
Goldziher merged 7 commits into
xberg-io:mainfrom
tobocop2:feat/data-enum-variant-constructors

Conversation

@tobocop2

@tobocop2 tobocop2 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Closes #147.

What

Generate a per-variant factory constructor for every data-carrying variant of an internally-tagged data enum, in the dynamic backends. The statically-typed backends already derive per-variant constructors from the variant structure; this brings the dynamic backends to the same model instead of leaving them on stringly-typed new(type=...) / raw maps.

Why

For the same enum, construction diverged purely by target. Take EmbeddingModelType (Preset { name }, Custom { model_id, dimensions }, Llm { llm }, Plugin { name }):

Static backends — already type-safe, the variant name is the constructor:

new EmbeddingModelType.Preset("balanced")          // C# / Java
.preset(name: "balanced")                          // Swift / Dart

Dynamic backends, before — stringly-typed, no per-variant constructor:

EmbeddingModelType(type="preset", name="balanced")   # pyo3 — magic "preset" string
{ type: 'preset', name: 'balanced' }                 # magnus — raw Hash
EmbeddingConfig::from_json(json_encode(["model" => ["type" => "preset", "name" => "balanced"]]))   # php — empty class, hand-built JSON
"{\"model\":{\"name\":\"balanced\",\"type\":\"preset\"}}"   # rustler — literal JSON string

Dynamic backends, after — per-variant constructors, parity with the static backends:

EmbeddingModelType.preset("balanced")
EmbeddingModelType.custom(model_id="BAAI/bge-small-en-v1.5", dimensions=384)
EmbeddingModelType.llm(cfg)
EmbeddingModelType.plugin("my-backend")
EmbeddingModelType.preset("balanced")
EmbeddingModelType::preset("balanced");
EmbeddingModelType.preset("balanced")

Type-safe (no magic "preset" string), discoverable via autocomplete, and complete — it covers Custom / Llm / Plugin, which the data-variant bare-string shorthand never addressed.

string_shorthand was merged in #133 to address #135, but it isn't a good design and is reverted here. A bare string is stringly-typed, only ever worked for the single-field Preset case, and only on pyo3 and magnus — it doesn't extend to Custom / Llm / Plugin or to the other backends. This PR addresses the same goal properly: every variant gets a typed per-variant constructor, the same way across all backends. #135 is superseded.

The unit-variant string handling from #132 is preserved and unaffected — a fieldless variant name ("disabled") is a fine string, and per-variant constructors don't replace it.

How

  • Derive one constructor per data struct variant from the variant's fields; body constructs the core variant directly (Self { inner: <core>::<Variant> { field, .. } }), reusing the existing param / let-binding / conversion machinery.
  • Each constructor collides with the variant accessor of the same name, so it uses the _factory_<name> Rust ident + the host-facing <name> (pyo3 #[pyo3(name=...)], etc.), per backend.
  • A hand-written impl constructor of the same name suppresses synthesis (consumer wins). Unit, tuple, and binding_excluded variants are skipped.
  • Static backends already conform — audited, no change.

Verification

Regenerated kreuzberg against this branch and checked every impacted binding:

  • All five backends emit the per-variant constructors on the real EmbeddingModelType / RerankerModelType (and other data enums) — EmbeddingModelType.preset(...) / .custom(...) / .llm(...) / .plugin(...) in Python, EmbeddingModelType.preset(...) in Ruby, ::preset(...) in PHP, EmbeddingModelType$preset(...) in R, EmbeddingModelType.preset(...) in Elixir.
  • cargo check passes on the regenerated kreuzberg-py, kreuzberg-php, and kreuzberg-node. This surfaced (and the fix commits resolve) several real-world field-type cases the neutral unit fixtures missed: variants with sanitized / binding-excluded fields are skipped (they can't be built from the binding), promoted-optional params unwrap to non-optional core fields, return-only DTOs get a generated From impl, and field conversions are inlined so no non-re-exported core type path is named.
  • pyo3/magnus: enum constructors reject bare variant strings for internally-tagged enums #132 (the unit-variant bare-string {"<tag>": s} wrap) is preserved across pyo3 and magnus; string_shorthand is fully removed with no dangling references.

The shared collect_variant_constructors (variant selection) and variant_field_init (field conversion) are the single source of truth for the wrapper-convert backends; magnus builds the binding enum directly; rustler emits pure-Elixir constructors. Whole-crate cargo test, clippy -D warnings, and fmt are clean.

tobocop2 added 7 commits June 25, 2026 00:00
Emit one `#[staticmethod]` constructor per data-carrying struct variant of an
internally-tagged data enum, so callers write `EmbeddingModelType.preset("balanced")`
instead of building the value through the stringly-typed
`EmbeddingModelType(type="preset", ...)` form. The discriminator is carried by the
variant name, which is type-safe and discoverable.

Each constructor builds the core variant directly
(`Self { inner: <core_path>::Preset { name } }`) and reuses the existing
param / let-binding / call-arg machinery (and the `pyo3_factory_method.jinja`
template) for field conversion. Constructors always collide with the variant
accessor of the same snake_case name, so they use the `_factory_<name>` Rust ident
plus `#[pyo3(name = "<name>")]`.

Skips unit variants, tuple variants, and binding_excluded variants. A hand-written
`impl` method of the same name suppresses the generated constructor (consumer wins).

The mapper arg to `gen_pyo3_data_enum_with_mapper` now drives this; the dead
associated-function factory projection (only ever called with `None` in production,
and explicitly unwanted) is removed. The pyo3 backend now passes the real
`Pyo3Mapper`, so the constructors emit in generated output. The existing
`#[new]` dict/kwargs/string constructor stays as-is; the variant constructors are
additive.
Build the `(field, converted_expr)` pairs for the per-variant constructor struct
literal directly from per-param expression vectors, instead of joining the exprs
with `gen_call_args` and re-splitting the comma-joined string. The re-split could
misalign field→expr if a converted expression ever carried a top-level comma or
tripped the `<`/`>` depth tracking.

Add `gen_call_args_vec` and `gen_call_args_with_let_bindings_json_str_vec` returning
`Vec<String>`; the existing joined helpers delegate to them so there is one source
of truth. Delete the now-unused `split_top_level_args`.
…t-variant wrap

Drop the opt-in `#[alef(string_shorthand(variant, field))]` data-variant
bare-string shorthand. Per-variant constructors supersede it: they cover every
data variant and keep the discriminator type-safe instead of stringly-typed.

Removed: the `StringShorthand` IR type and `EnumDef::string_shorthand` field;
`extract_string_shorthand`; `resolve_string_shorthand`; the
`string_shorthand_diagnostics` / `StringShorthandInvalid` validation path; and the
`shorthand_wire_variant`/`shorthand_field` template context for pyo3 and magnus.

The internally-tagged UNIT-variant bare-string wrap (xberg-io#132) stays: pyo3 still emits
`{"<tag>": s}` and magnus still emits the `{"<tag>": json_str}` TryConvert fallback.
The pyo3/magnus templates collapse to the plain `serde_tag` branch, and regression
tests assert the xberg-io#132 wrap survives in both backends.
Emit one singleton (class) constructor per data-carrying struct variant of a
magnus data enum, so Ruby callers write `Shape.circle(radius)` /
`Shape.rect(width, height)` instead of building a raw `{ "type" => "circle", ... }`
Hash. Each constructor builds the serde-shaped variant directly
(`Self::Circle { radius }`); the magnus data enum is binding-shaped, so the
parameters use the same types the generated enum declares and no core conversion
is needed.

The Rust function is `_factory_<name>` (registered under the bare snake_case name
via `define_singleton_method`) to avoid colliding with the variant accessor of
the same name. Data enums that gain constructors are now registered as a Ruby
class in `ruby_init`; enums with no qualifying struct variant stay unregistered
and keep round-tripping purely through serde IntoValue/TryConvert.

Unit, tuple, and `binding_excluded` variants are skipped, and a hand-written
`impl` method of the same name suppresses the generated constructor (consumer
wins). Variant selection is shared with the pyo3 path: `collect_variant_constructors`
and `VariantConstructor` in `src/codegen/generators/enums.rs` are lifted to
`pub(crate)` and re-exported (crate-internal) from the generators module as the
second consumer. The internally-tagged unit-variant bare-string fallback
(`{"<tag>": s}`) is untouched.
A tagged data enum lowered to a flat PHP class now exposes a static method
per data-carrying struct variant, so PHP callers write Shape::circle($radius)
instead of hand-building a JSON blob for from_json. Each method sets the
discriminator tag and the variant's flat field(s) directly, reusing the same
flat-field naming, tag value, and param->field conversion the core->binding
From impl uses; ..Default::default() covers the remaining optional fields and
is omitted when the variant covers every flat field.

The Rust fn is _factory_<snake> (exposed to PHP under the camelCase snake name)
to avoid colliding with the get_<field> accessor. Unit, tuple, and
binding_excluded variants are skipped, and a hand-written impl method of the
same name suppresses the generated constructor. Reuses collect_variant_constructors
shared with the pyo3/magnus paths.
A tagged data enum with struct variants (the JSON-passthrough shape) now exposes
a constructor per data-carrying variant on its R class env, so R callers write
EmbeddingModelType\$preset(name) alongside the existing \$default()/\$from_json()
instead of hand-rolling a JSON string. Each constructor builds the core variant
directly and .into()s it into the JSON-passthrough wrapper (wrapper-convert
model). DTO fields convert via <field>_core let bindings and extendr-remapped
numerics are cast back to the core type.

The Rust fn is _factory_<snake>; the R wrapper binds it under the bare snake
name. Unit, tuple, and binding_excluded variants are skipped, and a hand-written
impl method of the same name suppresses the generated constructor. Reuses
collect_variant_constructors shared with the pyo3/magnus/php paths, and adds the
reusable gen_call_args_with_let_bindings_json_str_cast_vec per-param helper for
numeric-remapping backends.
A tagged data enum with struct variants (the NifTaggedEnum shape) now exposes a
constructor per data-carrying variant in its generated Elixir module, so callers
write Shape.circle(radius) instead of hand-building the tagged tuple. Each
def <snake>(<params>), do: {:<atom>, %{<field>: <param>, ...}} builds the
{:variant, %{field: value}} form the NifTaggedEnum decoder consumes (the
plain-direct model: no NIF, no core conversion, matching what the existing
encode_<snake> param encoder accepts).

Reserved-word variant/param names are guarded via elixir_safe_param_name /
elixir_safe_atom. Unit, tuple, and binding_excluded variants are skipped, and a
hand-written impl method of the same name suppresses the generated constructor.
Reuses collect_variant_constructors shared with the pyo3/magnus/php/extendr
paths.
Goldziher added a commit that referenced this pull request Jun 25, 2026
…kends

Adds a per-variant factory constructor for every data-carrying variant of an
internally-tagged data enum across the dynamic backends (pyo3, magnus, php,
extendr, rustler), bringing them to parity with the statically-typed backends,
and removes the superseded `string_shorthand` mechanism (#135). Closes #147.

Reviewed: fmt, clippy, and the full test suite pass on the merge result.
@Goldziher Goldziher merged commit 62e1559 into xberg-io:main Jun 25, 2026
4 of 5 checks passed
@Goldziher

Copy link
Copy Markdown
Member

Merged into main. Reviewed end-to-end: cargo fmt --check, clippy --all-targets, and the full test suite pass on the merge result.

Two trivial adjustments during merge (neither touches the PR's backend logic):

  • Resolved a CHANGELOG.md conflict by keeping both Unreleased sections (the per-variant-constructor entries plus a concurrent e2e change), ordered Added → Changed → Removed.
  • The base had an unrelated cargo fmt nit and a few project-name mentions in e2e delegation comments that tripped the cli_no_project_special_casing guard; both fixed on main (separate commits), unrelated to this PR.

Thanks @tobocop2 — the draft body was stale; the work was complete across pyo3/magnus/php/extendr/rustler with the string_shorthand removal.

Goldziher added a commit that referenced this pull request Jun 26, 2026
JetBrains Runtime's Panama linker casts every FunctionDescriptor layout to
OfLong internally, so any sub-64-bit integer layout (JAVA_BYTE/SHORT/INT)
threw `ClassCastException: OfIntImpl cannot be cast to OfLong` at NativeLib
class load and corrupted TreeCursor FFM calls
(tree-sitter-language-pack#146, #148).

Promote bool, 8/16/32-bit ints, and enum discriminants to JAVA_LONG across
java_ffi_type, service_api, the enum-discriminant layout, the LAST_ERROR_CODE
descriptor, and the visitor/trait-bridge/registration callback descriptors.
java_ffi_return_cast now emits compound narrowing casts ((int)(long),
(short)(long), (byte)(long)) and the primitive-result templates no longer
double-wrap them. Generated FunctionDescriptors contain zero sub-64-bit
integer layouts; verified via the regenerated tree-sitter-language-pack
bindings (mvn verify passes).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pyo3/magnus/php/extendr/rustler: emit per-variant constructors for data enums (parity with the statically-typed backends)

2 participants