copy back skyrl code to top level and delete skyrl-train and skyrl-tx by erictang000 · Pull Request #1137 · NovaSky-AI/SkyRL

erictang000 · 2026-02-16T00:47:56Z

gemini-code-assist

Code Review

This pull request is a large refactoring that moves code from skyrl-train and skyrl-tx subdirectories into the top-level skyrl directory, and deletes the old directories. The changes primarily consist of moving files and updating import paths, which have been done consistently. There are also several improvements, such as bug fixes in shell scripts, better handling of gradient checkpointing in the JAX backend, and lazy initialization of inference engines. I've found one minor issue in the .gitignore file with duplicate entries that should be cleaned up.

gemini-code-assist · 2026-02-16T00:53:14Z

+uv.lock
+
+# PyInstaller
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# MkDocs build output
+site/
+
+# IDEs and editors
+.idea/
+.vscode/
+
+# OS generated files
+.DS_Store
+Thumbs.db
+
+# Hydra outputs
+outputs/
+
+# Local artifacts
+tinker.db
+uv.lock
+
+# Alembic - don't track pycache
+tx/tinker/alembic/__pycache__/
+
+# SQLite databases (tracked in git by default, but ignore if created locally)
+*.db


There are a few duplicate entries in the newly added lines. uv.lock is added on lines 79 and 131, and *.db is added on line 137. These entries seem to be duplicates of existing entries in the file. Please remove the redundant lines to keep the .gitignore file clean and maintainable.

devin-ai-integration

Devin Review found 1 potential issue.

View 7 additional findings in Devin Review.

devin-ai-integration · 2026-02-16T00:55:14Z

                input_ids,
                attention_mask=attention_mask,
                adapter_indices=adapter_indices,
-                is_training=True,
            )


🔴 Removed is_training=True from jax backend causes unnecessary KV cache allocation during Qwen3 training

During the migration, is_training=True was removed from the _model_forward call in the jax backend. For Qwen3 models (which still use StackedDecoderLayers with the is_training flag), this causes is_training to default to False, making the scan body compute and accumulate KV cache tensors for all layers during training.

Root Cause and Impact

The old code at skyrl-tx/tx/tinker/backends/jax.py:278 passed is_training=True:

output = model( input_ids, attention_mask=attention_mask, adapter_indices=adapter_indices, is_training=True, )

The new code at skyrl/backends/jax.py:274-278 omits this flag:

output = model( input_ids, attention_mask=attention_mask, adapter_indices=adapter_indices, )

Qwen3 models still accept and use is_training (see skyrl/tx/models/qwen3.py:349,419), which propagates to StackedDecoderLayers.__call__ (skyrl/tx/layers/stacked.py:269-270). When is_training=False (the default), the scan body does NOT zero out k and v, so the scan accumulates full KV cache tensors for every layer. At skyrl/tx/layers/stacked.py:279-285, a full KVCache is then constructed and returned, wasting GPU memory proportional to num_layers × batch_size × seq_len × num_heads × head_dim × 2.

Additionally, since config.gradient_checkpointing still propagates through the Qwen3 model to StackedDecoderLayers (line 274-275 of stacked.py), having the outer jax.checkpoint wrapping _model_forward at skyrl/backends/jax.py:284 creates redundant double gradient checkpointing.

Impact: Significant unnecessary GPU memory consumption during training for Qwen3 models, potentially causing OOM on memory-constrained setups. The training results are still correct since the returned KV cache is unused by the caller.

(Refers to lines 274-278)

Was this helpful? React with 👍 or 👎 to provide feedback.

x

f024d9d

erictang000 merged commit b15bbfb into NovaSky-AI:recover_git_history3 Feb 16, 2026
2 of 6 checks passed

erictang000 deleted the renaming3 branch February 16, 2026 00:48

gemini-code-assist Bot reviewed Feb 16, 2026

View reviewed changes

devin-ai-integration Bot reviewed Feb 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copy back skyrl code to top level and delete skyrl-train and skyrl-tx#1137

copy back skyrl code to top level and delete skyrl-train and skyrl-tx#1137
erictang000 merged 1 commit into
NovaSky-AI:recover_git_history3from
erictang000:renaming3

erictang000 commented Feb 16, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Feb 16, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

erictang000 commented Feb 16, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

erictang000 commented Feb 16, 2026 •

edited by devin-ai-integration Bot

Loading