Skip to content

Conversation

@rebel-jaebin
Copy link
Collaborator

@rebel-jaebin rebel-jaebin commented Oct 20, 2025

🚀 Summary of Changes

What does this PR do? What feature, fix, or improvement does it bring?

This PR adds continuous integration infrastructure for vLLM-RBLN on ARC runners. It introduces a GitHub Actions workflow that builds containerized environments and runs automated tests to ensure code quality and correctness across different configurations.


📌 Related Issues / Tickets

  • Resolves #
  • Related to #

✅ Type of Change

  • ✨ Feature (feature)
  • 🧠 Model support (model)
  • 🧬 Core engine changes (core)
  • 🛠 Bug fix (bug-fix)
  • ⚙️ Performance improvement (perf)
  • 🔁 Refactor or code cleanup (refactor)
  • 📄 Documentation (docs)
  • ❓ Other (other): Continuous Integration infrastructure

🧪 How to Test

  1. Run:
Trigger the workflow:
1. Open a pull request to trigger the CI automatically
2. Or manually trigger via workflow_dispatch with custom ref/Python version
  1. Verify output:
Verify build step:
1. Check that the container image builds successfully
2. Verify dependency caching works (subsequent builds should skip if deps unchanged)

Verify test execution:
1. Confirm tests run in the containerized environment
2. Check test results
  1. Edge case tested: ...

📸 Screenshots / Logs (if applicable)


📋 Checklist

  • PR title follows Conventional Commits format
  • This PR is linked to an existing issue
  • The test method is described, and the expected result is clearly stated
  • Relevant documentation has been updated (if applicable)

💬 Notes


@rebel-jaebin rebel-jaebin self-assigned this Oct 21, 2025
@rebel-jiwoopark rebel-jiwoopark changed the base branch from main to dev October 21, 2025 04:59
@rebel-jaebin rebel-jaebin force-pushed the feature/arc-ci branch 8 times, most recently from 6098926 to e120860 Compare October 23, 2025 07:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a GitHub Actions CI workflow for vLLM-RBLN that runs on ARC runners with specialized hardware support. The workflow builds a containerized environment with necessary drivers (ATOM, BNXT RDMA) and dependencies, then executes automated tests to validate code changes.

Key changes:

  • Adds a multi-stage Dockerfile for building RBEL-optimized containers with hardware drivers
  • Implements dependency-based image caching to skip rebuilds when dependencies are unchanged
  • Creates a CI workflow with build and test jobs for pull request validation

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
.github/workflows/rbln_arc_ci.yaml Defines the CI pipeline with conditional image building and test execution
Dockerfile.ubi.ci Multi-stage build for ATOM/BNXT drivers, Python environment, and vLLM-RBLN dependencies
entrypoint.sh Container initialization script for driver setup and command execution

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@rebel-jaebin
Copy link
Collaborator Author

Thank you for your detailed review. I have reflected your feedback. @rebel-daekyeong

@rebel-jaebin rebel-jaebin force-pushed the feature/arc-ci branch 4 times, most recently from 3214b52 to 730e6ae Compare October 27, 2025 07:46
@rebel-jaebin rebel-jaebin requested a review from dtrifiro October 27, 2025 11:17
@rebel-jaebin
Copy link
Collaborator Author

Hello @dtrifiro, I have a question regarding our next steps.
After the current PR is merged, what should we focus on next?
Are you planning to create test code for the vllm-rbln repository and open a PR for that?
I'd appreciate your guidance on what areas we should prioritize for support moving forward.
Thank you.

@rebel-jaebin rebel-jaebin force-pushed the feature/arc-ci branch 3 times, most recently from f5234b9 to 5bf7abd Compare November 6, 2025 10:52
Copy link

@chr15p chr15p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dockerfile looks good, I have some questions but nothing very serious.


# Install RDMA packages (Oracle Linux repos are already configured in Dockerfile)
dnf makecache \
&& dnf install -y \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we installing all these packages at run time? everytime you run a container it will try and install these and maybe fail to run if it can't (if its in a disconnected environment for example)

why can we not install them in the dockerfile itself instead?

Copy link
Collaborator Author

@rebel-jaebin rebel-jaebin Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because installing the RDMA-related packages in the Dockerfile caused issues where ibv_devices didn’t work properly, I updated the setup so that the packages are installed in entrypoint.sh instead.

@@ -0,0 +1,169 @@
ARG BASE_UBI_IMAGE_TAG=9.5
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this to ubi 9.6 maybe?

Copy link

@chr15p chr15p Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see 9.7 came out today :)
There are lots of bug and security fixes that get rolled up into the new versions so its worth updating if you can.

# Prepare host modules and udev triggers
depmod -a "$(uname -r)" 2>/dev/null || true
udevadm control --reload 2>/dev/null || true
udevadm trigger 2>/dev/null || true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do these in the entrypoint ? if you run the image without root permissions they will fail and I dont understand why we need them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the pieces of code that ensure the RDMA device is immediately usable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The drivers should be loaded and configured by the operator and the device-plugin, vllm then uses the already configured devices.

depmod is creating a list of driver dependencies which should be fixed for any version of the driver, and is only used when the driver is loaded. So if we need to do it then it should really be done once by the operator when it loads the driver, rather than every time we start a new instance of vllm.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will reflect the parts you mentioned in the additional PR.
Thank you for your detailed review.

baseurl=https://yum.oracle.com/repo/OracleLinux/OL9/appstream/x86_64/
enabled=1
gpgcheck=0
EOF

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use OL9 when using UBI image?

What are the additional packages you need?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only package that is not available in the default ubi repos is numactl-devel if you dont need this that you can get rid of the oracle repos entirely. As an alternative numactl-libs is available which includes the runtime numactl libraries, just not the header files to compile against.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the CI stage, we recompile vLLM by using the --no-binary option.
Even when building the CPU version, numa.h is required, so I install numactl-devel.

As @chr15p mentioned earlier, this package is not available in the default UBI repository, so I added an additional repo.

related error message

/root/.cache/uv/sdists-v9/pypi/vllm/0.10.2/-ni3cE3hJbPJ1ceUT2JNz/src/csrc/cpu/utils.cpp:2:12: fatal error: numa.h: No such file or directory
          2 |   #include <numa.h>
            |            ^~~~~~~~

https://github.com/vllm-project/vllm/blob/main/csrc/cpu/utils.cpp

./configure && \
make clean && \
make && \
make install

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these broadcom drivers used for?

Copy link
Collaborator Author

@rebel-jaebin rebel-jaebin Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadcom drivers are used for RDMA.

Copy link

@chr15p chr15p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a problem I'm just wondering why you structure the dockerfile this way

RUN chmod +x /entrypoint.sh
ENTRYPOINT [ "/entrypoint.sh" ]

CMD ["/bin/bash"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid running as the root user. Something like this works well for openshift:

# setup non-root user for OpenShift
RUN umask 002 && \
    useradd --uid 2000 --gid 0 vllm && \
    mkdir -p /home/vllm && \
    chmod g+rwx /home/vllm
USER 2000

This change could cause permission issues in other locations, so this needs to be validated.

Comment on lines +10 to +19
dnf makecache \
&& dnf install -y \
rdma-core \
librdmacm \
libibverbs \
libibverbs-utils \
infiniband-diags \
pciutils \
kmod \
&& dnf clean all

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @chr15p mentioned, we shouldn't be installing these at runtime, also, this requires the container to run as root, see my above comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need root permissions in order to run depmod (shouldn't the operator be taking care of this though?), we need some mechanism to drop permissions before running vllm as well.

@rebel-jaebin
Copy link
Collaborator Author

@chr15p @dtrifiro @tsailiming

  • Rebellions will provide an NPU Operator that installs the device driver and related software.
    However, since the NPU Operator is still under development, these components must currently be installed manually.
    Therefore, most of the NPU-related software installation steps currently included in the Dockerfile will eventually be removed.

  • RDMA is an important feature for vLLM-RBLN, which uses RBLN-CCL for performance in DP, TP, PP, EP, and P/D (Prefill/Decode Disaggregation).
    As a result, many code additions were made to ensure RDMA support.

I wanted to share this background context.

It seems the root-privilege issue needs to be fixed.
Thank you.

pyproject.toml Outdated
Comment on lines 45 to 46
"rebel-compiler==0.9.3.dev120+gd2b492be",
"triton-rbln==3.2.0+rbln.git617682d1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this depedencies?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use these dependencies to develop our internal functions.
Additionally, i will add a commit to ensure that the above dependencies are installed only when given the option to not affect the main branch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great.

@rebel-jaebin
Copy link
Collaborator Author

To all reviewers

The current PR will be merged once CI is approved.
Anything not reflected in the current PR will be reflected in a subsequent PR.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants