Skip to content

Refactor data generation return type#177

Open
ealt wants to merge 3 commits intomainfrom
refactor/data-gen
Open

Refactor data generation return type#177
ealt wants to merge 3 commits intomainfrom
refactor/data-gen

Conversation

@ealt
Copy link
Collaborator

@ealt ealt commented Mar 11, 2026

see docs/plans/refactor-gen-process-returns.md


Note

Medium Risk
This is a breaking API change: generate() and the generator helpers no longer return tuples, so any external call sites must migrate to dict key access. Moderate risk of subtle shape/empty-sentinel regressions in downstream code that assumed previous tuple semantics.

Overview
Refactors generative-process generation APIs to return structured TypedDict payloads instead of tuples. GenerativeProcess.generate() (and IndependentFactoredGenerativeProcess.generate()) now returns GenerateResult with states, observations, and all_states; when return_all_states=False, all_states becomes a pytree-preserving empty sentinel.

Unifies generate_data_batch() and generate_data_batch_with_full_history() to both return a single DataBatch with always-present gen_states, inputs, labels, belief_states, and prefix_probabilities (empty (batch, 0) fields when history isn’t requested), and mirrors this in the torch wrapper via TorchDataBatch while converting only token tensors to torch. Updates end-to-end training and test suites to use dict access and adds assertions covering empty sentinel shapes and that states reflects the final scan carry.

Written by Cursor Bugbot for commit b6a67e8. This will update automatically on new commits. Configure here.

@ealt ealt marked this pull request as draft March 11, 2026 00:16
@ealt ealt marked this pull request as ready for review March 11, 2026 00:22
@ealt ealt requested a review from casperlchristensen March 11, 2026 00:22
Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a well-structured refactor that improves type safety by moving from tuple returns to TypedDicts. The approach is sound and the test coverage is thorough. A few issues need addressing before merge.

@casperlchristensen
Copy link
Collaborator

@claude The suggested changes are good -- please add them.

@claude
Copy link

claude bot commented Mar 11, 2026

Claude finished @casperlchristensen's task in 1m 17s —— View job


Applied Review Suggestions

  • Read relevant files
  • Fix imports in torch_generator.py — merged 3 separate import blocks from simplexity.generative_processes.generator into one
  • Fix generate_data_batch to use TorchDataBatch(...) constructor instead of plain dict literal
  • Fix generate_data_batch_with_full_history to use TorchDataBatch(...) constructor instead of plain dict literal
  • Update outdated Returns docstring in generate_data_batch_with_full_history to document all 5 fields including gen_states
  • Fix stale belief_statesall_states references in generative_process.py docstring
  • Committed and pushed to refactor/data-gen

All 4 review suggestions have been applied in commit 85c4ee1.
|

- Merge three separate import blocks from simplexity.generative_processes.generator into one
- Use TorchDataBatch(...) constructor instead of plain dict literals in both generate functions
- Update outdated Returns docstring in generate_data_batch_with_full_history to list all 5 fields
- Fix stale belief_states references in generative_process.py docstring to use all_states

Co-authored-by: Casper Lützhøft Christensen <casperlchristensen@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants