-
Notifications
You must be signed in to change notification settings - Fork 5
ci: add GitHub Actions workflow for ARC CI testing #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
6098926 to
e120860
Compare
173b959 to
6607573
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a GitHub Actions CI workflow for vLLM-RBLN that runs on ARC runners with specialized hardware support. The workflow builds a containerized environment with necessary drivers (ATOM, BNXT RDMA) and dependencies, then executes automated tests to validate code changes.
Key changes:
- Adds a multi-stage Dockerfile for building RBEL-optimized containers with hardware drivers
- Implements dependency-based image caching to skip rebuilds when dependencies are unchanged
- Creates a CI workflow with build and test jobs for pull request validation
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| .github/workflows/rbln_arc_ci.yaml | Defines the CI pipeline with conditional image building and test execution |
| Dockerfile.ubi.ci | Multi-stage build for ATOM/BNXT drivers, Python environment, and vLLM-RBLN dependencies |
| entrypoint.sh | Container initialization script for driver setup and command execution |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
6607573 to
a6e4268
Compare
|
Thank you for your detailed review. I have reflected your feedback. @rebel-daekyeong |
3214b52 to
730e6ae
Compare
|
Hello @dtrifiro, I have a question regarding our next steps. |
f5234b9 to
5bf7abd
Compare
5bf7abd to
4e94400
Compare
2944d95 to
68d606c
Compare
chr15p
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dockerfile looks good, I have some questions but nothing very serious.
|
|
||
| # Install RDMA packages (Oracle Linux repos are already configured in Dockerfile) | ||
| dnf makecache \ | ||
| && dnf install -y \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we installing all these packages at run time? everytime you run a container it will try and install these and maybe fail to run if it can't (if its in a disconnected environment for example)
why can we not install them in the dockerfile itself instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because installing the RDMA-related packages in the Dockerfile caused issues where ibv_devices didn’t work properly, I updated the setup so that the packages are installed in entrypoint.sh instead.
Dockerfile.ubi.ci
Outdated
| @@ -0,0 +1,169 @@ | |||
| ARG BASE_UBI_IMAGE_TAG=9.5 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update this to ubi 9.6 maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see 9.7 came out today :)
There are lots of bug and security fixes that get rolled up into the new versions so its worth updating if you can.
| # Prepare host modules and udev triggers | ||
| depmod -a "$(uname -r)" 2>/dev/null || true | ||
| udevadm control --reload 2>/dev/null || true | ||
| udevadm trigger 2>/dev/null || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do these in the entrypoint ? if you run the image without root permissions they will fail and I dont understand why we need them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the pieces of code that ensure the RDMA device is immediately usable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The drivers should be loaded and configured by the operator and the device-plugin, vllm then uses the already configured devices.
depmod is creating a list of driver dependencies which should be fixed for any version of the driver, and is only used when the driver is loaded. So if we need to do it then it should really be done once by the operator when it loads the driver, rather than every time we start a new instance of vllm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will reflect the parts you mentioned in the additional PR.
Thank you for your detailed review.
| baseurl=https://yum.oracle.com/repo/OracleLinux/OL9/appstream/x86_64/ | ||
| enabled=1 | ||
| gpgcheck=0 | ||
| EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to use OL9 when using UBI image?
What are the additional packages you need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only package that is not available in the default ubi repos is numactl-devel if you dont need this that you can get rid of the oracle repos entirely. As an alternative numactl-libs is available which includes the runtime numactl libraries, just not the header files to compile against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the CI stage, we recompile vLLM by using the --no-binary option.
Even when building the CPU version, numa.h is required, so I install numactl-devel.
As @chr15p mentioned earlier, this package is not available in the default UBI repository, so I added an additional repo.
related error message
/root/.cache/uv/sdists-v9/pypi/vllm/0.10.2/-ni3cE3hJbPJ1ceUT2JNz/src/csrc/cpu/utils.cpp:2:12: fatal error: numa.h: No such file or directory
2 | #include <numa.h>
| ^~~~~~~~
https://github.com/vllm-project/vllm/blob/main/csrc/cpu/utils.cpp
| ./configure && \ | ||
| make clean && \ | ||
| make && \ | ||
| make install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these broadcom drivers used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broadcom drivers are used for RDMA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a problem I'm just wondering why you structure the dockerfile this way
| RUN chmod +x /entrypoint.sh | ||
| ENTRYPOINT [ "/entrypoint.sh" ] | ||
|
|
||
| CMD ["/bin/bash"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid running as the root user. Something like this works well for openshift:
# setup non-root user for OpenShift
RUN umask 002 && \
useradd --uid 2000 --gid 0 vllm && \
mkdir -p /home/vllm && \
chmod g+rwx /home/vllm
USER 2000This change could cause permission issues in other locations, so this needs to be validated.
| dnf makecache \ | ||
| && dnf install -y \ | ||
| rdma-core \ | ||
| librdmacm \ | ||
| libibverbs \ | ||
| libibverbs-utils \ | ||
| infiniband-diags \ | ||
| pciutils \ | ||
| kmod \ | ||
| && dnf clean all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @chr15p mentioned, we shouldn't be installing these at runtime, also, this requires the container to run as root, see my above comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need root permissions in order to run depmod (shouldn't the operator be taking care of this though?), we need some mechanism to drop permissions before running vllm as well.
I wanted to share this background context. It seems the root-privilege issue needs to be fixed. |
pyproject.toml
Outdated
| "rebel-compiler==0.9.3.dev120+gd2b492be", | ||
| "triton-rbln==3.2.0+rbln.git617682d1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this depedencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use these dependencies to develop our internal functions.
Additionally, i will add a commit to ensure that the above dependencies are installed only when given the option to not affect the main branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great.
|
To all reviewers The current PR will be merged once CI is approved. |
🚀 Summary of Changes
This PR adds continuous integration infrastructure for vLLM-RBLN on ARC runners. It introduces a GitHub Actions workflow that builds containerized environments and runs automated tests to ensure code quality and correctness across different configurations.
📌 Related Issues / Tickets
✅ Type of Change
feature)model)core)bug-fix)perf)refactor)docs)other): Continuous Integration infrastructure🧪 How to Test
...📸 Screenshots / Logs (if applicable)
📋 Checklist
💬 Notes