pyrefly: fix real bugs surfaced by 0.61 bump#4804
pyrefly: fix real bugs surfaced by 0.61 bump#4804rjpower wants to merge 6 commits intopyrefly/bump-0.61from
Conversation
- v1/cluster/__init__.py: drop "TPUConfig" from __all__ (only TpuConfig is imported; the wrong-case name never existed). - v1/cluster/ray/tpu/__init__.py: drop "TPU_CONFIGS", "TPUConfig", "get_tpu_config" from __all__ (none are defined or imported). - v1/cluster/base.py, v2/types.py: replace Self with class name in five @staticmethod return annotations (Self is invalid on staticmethods); drop the now-unused Self import. - v1/queue/base.py: Queue stores and returns T on both sides, so T_co (covariant) is wrong when used as an input position. Switch to invariant T across Queue and MemoryQueue.
haliax/jax_utils.py: ensure_tuple lives in haliax.util, not haliax.core. iris/cluster/config.py: ControllerDB can be imported at module scope — no circular import — so drop the stale "# noqa: F821" string forward reference. iris/cluster/controller/db.py: the EndpointRegistry forward-ref is a real cycle; guard with TYPE_CHECKING instead of "# noqa: F821". iris/cluster/controller/controller.py: - Widen _building_counts and _inject_reservation_taints to accept list[WorkerSnapshot] (matches Scheduler.create_scheduling_context; WorkerRow already satisfies the Protocol). - _capture_one_profile asserts the provider is not a K8sTaskProvider before calling the TaskProvider Protocol signature with (address, request, timeout_ms). The profile loop is only spawned on the non-K8s branch (see start()), so this narrows for the type checker without behavior change. zephyr/plan.py: MapShardOp.fn takes a ShardInfo, not the old shard_idx/total_shards kwargs (see the reference site at line 237).
inference/openai.py: in _batch_processing_loop, the Exception handler accessed `batch` which is only assigned by queue_get. If the Exception came from the queue.get itself (other than queue.Empty), `batch` was unbound and the handler would crash with UnboundLocalError. Restructure so the queue.Empty / "other" split happens before the work block, mirroring the intent. optim/model_averaging.py: EmaModelAveraging.update and EmaDecaySqrtModelAveraging.update renamed their `model` parameter to `new_model`, which breaks keyword-argument overrides of the abstract base (ModelAveraging.update). Rename back to `model`. utils/py_utils.py: FailSafeJSONEncoder.default had `obj` as the param name but stdlib JSONEncoder.default names it `o`. Rename and alias locally to keep the body unchanged. data/dataset.py, data/sharded_datasource.py, store/cache.py: three classes used a TypeVar that wasn't in scope (class wasn't declared Generic). Widen the input typevars to Any where unused (dataset, sharded_datasource) and add Generic[T] to SerialCacheWriter so its `exemplar: T` attribute resolves.
Dead code:
- Delete lib/marin/src/marin/infra/ package: the __init__.py imports
and re-exports `TpuMonitor` / `start_tpu_monitor_on_head` from a
deleted `tpu_monitor` module, and nothing imports `marin.infra`.
Guaranteed / possible runtime crashes:
- rl/weight_transfer/arrow_flight.py:267: drop a stray debug
`return 123` from a function declared `-> None`; widen
`fetch_server` return to `ServerInfo | None` (caller already
handles None).
- rl/weight_transfer/arrow_flight.py:307: guard against
`self._latest_weight_id is None` before using it as a dict key.
- rl/rollout_storage.py:200: `while time.time() - start_time <
timeout` crashes when `timeout is None` (signature allows it).
Compute an explicit deadline instead.
- datakit/download/wikipedia.py:61: tarfile.extractfile can return
None for non-regular entries; skip those rather than entering `with
None`.
- transform/wikipedia/transform_wikipedia.py:180: fix return
annotation to `str | None` (function already returns None on
filter failure).
- transform/conversation/adapters.py:
- StrEnum members no longer annotated `: str = "..."`.
- `transform_conversation_to_openai_format` returns `list | None`;
guard against best_completion being None before indexing.
- transform/conversation/transform_conversation.py:123: update
annotation to match the callee (already None-checked).
- transform/fasttext/transform.py: add `return True` to
`convert_fasttext_to_dolma_format` (declared `-> bool` but fell
through).
- rl/environments/inference_ctx/inflight/worker.py:215: `generate`
returns list[str], not str.
API / type drift:
- execution/executor.py: replace 6 sites that used `dataclass` (the
decorator) as a type annotation with `Any`.
- cluster/config.py: add _GenerationConfig TypedDict so
`GENERATION_CONFIGS[x]["base_worker"]` is typed consistently.
- utilities/json_encoder.py: `CustomJsonEncoder.default(obj=)` →
`default(o=)` matching stdlib JSONEncoder.default.
Baseline: 169 → 129 diagnostics, file shrank 2031 → 1551 lines. Removed `deprecated = false` from [tool.pyrefly.errors]: the category now has zero unbaselined diagnostics, so the blanket disable is no longer needed.
|
Claude finished @rjpower's task in 3m 28s —— View job PR ReviewReviewed all 30 changed files against
Verified-correct fixes
Nits / suggestions
Test plan gapThe PR notes "Full tests not run locally; rely on CI." For the levanter Bottom lineLGTM with the nits above. The changes are small, well-scoped, and each is justified by the diagnostics. The 169→129 baseline reduction is real progress without smuggling in functional drift. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c96df0be1c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if isinstance(op, Map): | ||
| if op.needs_shard_context: | ||
| stream = op.fn(stream, shard_idx=ctx.shard_idx, total_shards=ctx.total_shards) | ||
| stream = op.fn(stream, ShardInfo(shard_idx=ctx.shard_idx, total_shards=ctx.total_shards)) |
There was a problem hiding this comment.
Pass shard context as keyword args in run_stage
When op.needs_shard_context is true, this now passes a positional ShardInfo object into op.fn, but Map functions created by compose_map expect keyword-only shard_idx/total_shards parameters. Any stage containing MapShardOp will hit TypeError (pipeline() takes 1 positional argument but 2 were given) at runtime, breaking shard-context transforms; keep keyword arguments here or update both call sites/signatures consistently.
Useful? React with 👍 / 👎.
| if TYPE_CHECKING: | ||
| from iris.cluster.controller.endpoint_registry import EndpointRegistry |
There was a problem hiding this comment.
Avoid TYPE_CHECKING import guard in controller DB
This introduces a TYPE_CHECKING-guarded import, which violates the explicit Iris convention in lib/iris/AGENTS.md (“Avoid TYPE_CHECKING. Use real imports. If you hit a cycle, prefer refactoring or use a Protocol at the boundary.”). The same rule is also present in /AGENTS.md; please replace this guard with a structural cycle break rather than adding a forbidden import pattern.
Useful? React with 👍 / 👎.
|
@claude the tests are failing due to the issues noted in the PR comments, pls address. address nits (1) and (2) as well TypeError: compose_map..pipeline() takes 1 positional argument but 2 were given |
|
Claude finished @rjpower's task in 14m 11s —— View job Done — pushed commit
|
zephyr plan.py: the run_stage call to op.fn invokes a compose_map
pipeline (which takes shard_idx/total_shards as kwargs), not a
user-supplied MapShardOp.fn (which takes a positional ShardInfo).
The PR-0 change at line 779 conflated the two and broke
test_deduplicate_with_num_output_shards with
"compose_map.<locals>.pipeline() takes 1 positional argument but
2 were given". Revert to kwargs; add a comment distinguishing the
physical Map.fn call site from the logical MapShardOp.fn call at
line 237.
iris db.py: replace TYPE_CHECKING-guarded forward-ref import of
EndpointRegistry with a local EndpointRegistryProtocol. lib/iris
AGENTS.md explicitly forbids TYPE_CHECKING ("prefer refactoring
or use a Protocol at the boundary"); the real module-level cycle
(db -> endpoint_registry -> db) means a plain import would cycle,
so Protocol at the boundary is the documented fix. The concrete
EndpointRegistry (imported locally inside __init__) continues to
satisfy the Protocol structurally.
arrow_flight.py nit 1: the broad `except Exception` was rewrapping
the typed FlightUnavailableError raised at line 308 as
FlightInternalError, defeating the typed-retry signal. Catch
FlightUnavailableError before the generic handler and re-raise
unchanged.
rollout_storage.py nit 2: inline `start_time` into the `deadline`
computation (the local was only referenced once).
Summary
Stacked on #4801 (PR-base, pyrefly 0.42→0.61 bump). Fixes the set of real
bugs (not pure typing cleanup) surfaced in the Stage-3 cluster triage
memos. Grouped into one commit per subproject plus a final baseline
refresh.
Baseline delta: 169 → 129 diagnostics (−40);
.pyrefly-baseline.jsonshrank from 2031 → 1551 lines.
Removed
deprecated = falsefrom[tool.pyrefly.errors]— the categorynow has zero unbaselined diagnostics under 0.61.
Bugs fixed
fray
v1/cluster/__init__.py:126—__all__lists"TPUConfig", onlyTpuConfigis imported. Downstreamfrom fray.v1.cluster import TPUConfigwould silently AttributeError.v1/cluster/ray/tpu/__init__.py:30,36,45—__all__listsTPU_CONFIGS,TPUConfig,get_tpu_config— none defined orimported. Drop from
__all__.v1/cluster/base.py:{435,443,482},v2/types.py:{508,516}—Selfused as return type on
@staticmethod; meaningless. Replace with theexplicit class name.
v1/queue/base.py:32,56—Queueprotocol usedT_co(covariant) incontravariant input positions. Switch to invariant
T.haliax
jax_utils.py:360—ensure_tupleimported fromhaliax.core; itlives in
haliax.util.iris
cluster/config.py:1218—db: "ControllerDB | None"marked# noqa: F821for a circular import that doesn't actually exist.Import
ControllerDBat module scope, drop the noqa.cluster/controller/db.py:334— theEndpointRegistryforward-ref isa real cycle; use
TYPE_CHECKINGimport instead of# noqa: F821.cluster/controller/controller.py:312,315,690,691— helpers_building_countsand_inject_reservation_taintstyped againstlist[WorkerRow]but the caller passeslist[WorkerSnapshot]. Widenhelpers to
list[WorkerSnapshot](WorkerRow satisfies the Protocol).cluster/controller/controller.py:1546—_capture_one_profileasserts_provideris not aK8sTaskProviderbefore calling theTaskProvider-Protocol signature. The profile loop is only spawned on
the non-K8s branch (see
start()at line 1202), so no behavior change.zephyr
plan.py:779—MapShardOp.fntakesShardInfo, not staleshard_idx=/total_shards=kwargs (see reference at line 237).levanter
inference/openai.py:363— the Exception handler in_batch_processing_loopreferencedbatch, which was unbound ifqueue.get()raised anything other thanqueue.Empty. GuaranteedUnboundLocalErroron the error path. Restructure so the queuewait and the work block are in separate try blocks.
optim/model_averaging.py:42,73—EmaModelAveraging.updateandEmaDecaySqrtModelAveraging.updaterenamed theirmodelparameterto
new_model, breaking keyword-argument overrides of the abstractbase
ModelAveraging.update. Rename back.utils/py_utils.py:129—FailSafeJSONEncoder.default(obj=)vsstdlib
JSONEncoder.default(o=); rename with local alias.data/dataset.py:325,data/sharded_datasource.py:531,store/cache.py:220— three classes used a TypeVar outside the classgeneric scope. Widen to
Anyor addGeneric[T].marin
lib/marin/src/marin/infra/package:__init__.pyimports
TpuMonitor/start_tpu_monitor_on_headfrom a deletedtpu_monitorsubmodule; nothing importsmarin.infra.rl/weight_transfer/arrow_flight.py:267— stray debugreturn 123from a function declared
-> None; widenfetch_serverreturn toServerInfo | None(caller already handles None).rl/weight_transfer/arrow_flight.py:307— guardself._latest_weight_id is Nonebefore using it as a dict key.rl/rollout_storage.py:200—while time.time() - start_time < timeoutcrashes ontimeout is None(signature allows it); computea deadline.
datakit/download/wikipedia.py:61—tarfile.extractfile()returnsNonefor non-regular entries; skip rather thanwith None.transform/wikipedia/transform_wikipedia.py:180—postprocess_contentdeclared
-> str, returnsNoneon filter failure. Annotatestr | None(caller already None-checks).transform/conversation/adapters.py:52-55,91,110— StrEnum memberannotations dropped (invalid);
transform_conversation_to_openai_formatannotated
list | None; guard againstbest_completion is Nonebefore subscripting.
transform/conversation/transform_conversation.py:123— callerannotation updated to match.
transform/fasttext/transform.py— add missingreturn Truetoconvert_fasttext_to_dolma_format(declared-> bool, fell through).rl/environments/inference_ctx/inflight/worker.py:215—generatereturns
list[str], notstr.execution/executor.py:155,634,973,1137,1138,1209— replace 6 siteswhere
dataclass(the decorator) was used as a type annotation withAny.cluster/config.py:317— add_GenerationConfigTypedDict annotationto
GENERATION_CONFIGS.utilities/json_encoder.py:18—CustomJsonEncoder.defaultparamrename.
Deferred
lib/marin/src/marin/rl/scripts/replay_completions.py— the wholescript is stale against the renamed
InferenceServer.create(config, model, tokenizer)API and the slimmed-downInferenceServerConfig.Fixing requires loading a concrete model/tokenizer at script entry,
which is a rewrite beyond PR-0 scope. The script was already
non-functional.
lib/iris/src/iris/cluster/controller/controller.py:2176—self._provider.sync(batches)dispatches acrossTaskProvider.sync(list[DispatchBatch])vsK8sTaskProvider.sync(DirectProviderBatch). K8s path already uses_sync_direct_providerat line 1455; reconciling the Protocol needsiris-owner judgement.
Test plan
./infra/pre-commit.py --all-files— greenuvx pyrefly@0.61.0 check --baseline .pyrefly-baseline.json—0 errors
.pyrefly-baseline.jsonstrictly smaller than parent branch(2031 → 1551 lines; 169 → 129 diagnostics)