Skip to content

[WIP][Driver][SYCL] Enable --offload-arch support for SYCL offloading. #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

srividya-sundaram
Copy link
Owner

@srividya-sundaram srividya-sundaram commented Mar 7, 2025

This patch enables Clang to support SYCL offloading to several different architectures such as Intel CPUs, Intel GPUs, NVidia and AMD GPUs using the offload-arch= option.

Using offload-arch= option, users can specify an offloading device architecture for SYCL ( in addition to CUDA, HIP or OpenMP offloading).

The target triple strings are deduced from the arch values passed via offload-arch= option and the corresponding toolchains are constructed.

Example:

clang++   --offload-new-driver  -fsycl  --offload-arch=bdw              // Offload SYCL code to Intel GPU
clang++   --offload-new-driver  -fsycl  --offload-arch=broadwell    // Offload SYCL code to Intel CPU
clang++   --offload-new-driver  -fsycl  --offload-arch=sm_80          // Offload SYCL code to NVidia GPU
clang++   --offload-new-driver  -fsycl  --offload-arch=gfx700         // Offload SYCL code to AMD GPU

@@ -106,6 +106,8 @@ enum class OffloadArch {
GFX90a,
GFX90c,
GFX9_4_GENERIC,
GFX940,
GFX941,
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add code to enable SYCL offloading to these AMD GPU targets.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These targets have been removed in community, we should follow and not add them in our upstream efforts.

return false;
if(SYCLTriple.isSPIRAOT())
return false;
}
// Check current set of triples to see if the default has already been set.
for (const auto &SYCLTriple : SYCLTriples) {
if (SYCLTriple.getSubArch() == llvm::Triple::NoSubArch &&
SYCLTriple.isSPIROrSPIRV())
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Update this check to SYCLTriple.isSPIRV()

@srividya-sundaram srividya-sundaram marked this pull request as ready for review March 10, 2025 17:18
@srividya-sundaram srividya-sundaram changed the title [Driver][SYCL] Enable --offload-arch support for SYCL offloading. [WIP][Driver][SYCL] Enable --offload-arch support for SYCL offloading. Mar 10, 2025
@srividya-sundaram srividya-sundaram marked this pull request as draft March 10, 2025 17:21
@sarnex
Copy link
Collaborator

sarnex commented Mar 10, 2025

Sorry is this the upstream version of Mike's internal PR?

@mdtoguchi
Copy link
Collaborator

Sorry is this the upstream version of Mike's internal PR?

@sarnex , no the internal PR is for OpenMP enabling of --offload-arch with the new model.

@sarnex
Copy link
Collaborator

sarnex commented Mar 10, 2025

Got it, thx

@srividya-sundaram srividya-sundaram requested a review from bader March 10, 2025 19:51
@srividya-sundaram
Copy link
Owner Author

Sorry is this the upstream version of Mike's internal PR?

This PR is for enabling SYCL offloading to Intel and third-party GPUs and CPUs in the community clang(llvm-project) and a follow up to llvm#117268

@bader
Copy link
Collaborator

bader commented Mar 10, 2025

clang++ --offload-new-driver -fsycl --offload-arch=bdw // Offload SYCL code to Intel GPU
clang++ --offload-new-driver -fsycl --offload-arch=broadwell // Offload SYCL code to Intel CPU

To be honest, I found these names very confusing. As a user, it's hard to understand the difference between bdw and broadwell. Ideally, we should have code names (or some other identifiers) to separate GPU architecture from CPU architecture. For me, bdw sounds like CPU microarchitecture abbreviation.

Do you think it's possible to add igpu- prefix to official GPU architecture name to clearly separate from the CPU architectures (e.g. --offload-arch=igpu-bdw)?
Another option is to add GPU generation code name: bdw-gen8.

Copy link
Collaborator

@sarnex sarnex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial pass, thanks!

/// Vector of Macros that need to be added to the Host compilation in a
/// SYCL based offloading scenario. These macros are gathered during
/// construction of the device compilations.
mutable std::vector<std::string> SYCLTargetMacroArgs;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i took a quick look and didn't see openmp doing the same thing, is it possible to do what the other offloading languages do?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIC, SYCL target macros are generated during SYCL device compilation and they are passed to both SYCL device and host compilation step.
Example:

clang -###  -fsycl --offload-arch=bdw -nogpulib test.cpp

clang-20   -cc1  "-fsycl-is-device"  -D__SYCL_TARGET_INTEL_GPU_BDW__   .....
clang-20 -cc1    "-fsycl-is-host" "-D__SYCL_TARGET_INTEL_GPU_BDW__"   .......

This is the existing behavior ported from intel/llvm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These macros mean that we compile for specific target. Typically, they are set by the LLVM backend. We don't have LLVM backend for SPIR target, so we added them in the driver.
Considering that we have SPIR-V backend, we can consider moving this logic to SPIR-V backend, but I'm not sure how community will respond to adding INTEL specific macros to SPIR-V backend.
Maybe we should consider adding Intel's flavor to SPIR-V as AMD folks. They added some macro here: https://github.com/llvm/llvm-project/pull/89796/files#diff-4c95416395669c428da9b77967b9f70792b863f8b32906bbc39692ec0988ab9fR93-R95

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should have some reasonable success if we add 'limited' number of macros. How many do we expect to add?

Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have unique values for each GPU target. So each potential GPU we support, we would have a macro that is associated with it. Given the number we currently have here:

SmallString<64> clang::driver::getGenDeviceMacro(StringRef DeviceName) {
the number isn't small. We would have those, and we also have one generic __SYCL_TARGET_INTEL_X86_64__ for CPU. Additionally, the driver is emitting all of the __SYCL_ANY_DEVICE_HAS_*__ and __SYCL_ALL_DEVICES_HAVE_*__ macros. Definitely not 'limited' by any means.

@@ -846,16 +850,25 @@ static llvm::Triple getSYCLDeviceTriple(StringRef TargetArch) {

static bool addSYCLDefaultTriple(Compilation &C,
SmallVectorImpl<llvm::Triple> &SYCLTriples) {

llvm::Triple DefaultTriple = getSYCLDeviceTriple(
C.getDefaultToolChain().getTriple().isArch32Bit() ? "spirv32"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've never seen spirv32 used, does it actually work? can IGC compile it?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This was originally introduced by @mdtoguchi in this PR

intel/llvm has spir64 as the default target when -fsycl is passed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Way back when - when we still supported 32-bit targets with the DPC++ compiler, the equivalent usage for the device compilation was spir, and this is the natural extension of that when moving to spirv based arch values. If we need to restrict the targets for IGC, it would be good to do it now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has nothing to do with the IGC. IGC compiles SPIR-V. The right question is: "Does LLVM-SPIRV-Translator support LLVM with spirv32 target triple?" According to my understanding, the answer to this question is "Yes".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned IGC because of the data type size differences, but yeah the translator behavior is important too. If the translator/SPIR-V backend work with it, we should support it.

Copy link
Collaborator

@bader bader Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, 32-bit targets were important for some platforms like Android and/or Windows. I don't think SYCL is enabled on such platforms though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks.

@@ -1066,19 +1079,119 @@ void Driver::CreateOffloadingDeviceToolChains(Compilation &C,
// -ffreestanding cannot be used with -fsycl
argSYCLIncompatible(options::OPT_ffreestanding);

// Map of SYCL target triple strings to their corresponding target archs.
// Example: spir64_x86_64 --> SKYLAKEAVX512
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can the example have multiple arches? that should be possible given the map value is a set right?

}
} else if (Triple.getSubArch() == llvm::Triple::SPIRSubArch_x86_64)
Macro = "-D__SYCL_TARGET_INTEL_X86_64__";
if (Macro.size()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
if (Macro.size()) {
if (!Macro.empty()) {

DeviceOffloadingKind == Action::OFK_Cuda) &&
"Only OpenMP or CUDA offloading kinds are supported for NVIDIA GPUs.");
DeviceOffloadingKind == Action::OFK_Cuda || DeviceOffloadingKind == Action::OFK_SYCL) &&
"Only OpenMP or CUDA or SYCL offloading kinds are supported for NVIDIA GPUs.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Only OpenMP or CUDA or SYCL offloading kinds are supported for NVIDIA GPUs.");
"Only OpenMP, CUDA, and SYCL offloading kinds are supported for NVIDIA GPUs.");

{"westmere", SYCLSupportedIntelArchs::WESTMERE},
{"sandybridge", SYCLSupportedIntelArchs::SANDYBRIDGE},
{"ivybridge", SYCLSupportedIntelArchs::IVYBRIDGE},
{"broadwell", SYCLSupportedIntelArchs::BROADWELL},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could automatically generate this with tablegen. Unfortunately, I'm not good at tablegen :)

// Check if the user provided value for --offload-arch is a valid
// SYCL supported Intel AOT target.
SYCLSupportedIntelArchs
StringToOffloadArchSYCL(llvm::StringRef ArchNameAsString);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we dont use the namespace in the later fcn

Suggested change
StringToOffloadArchSYCL(llvm::StringRef ArchNameAsString);
StringToOffloadArchSYCL(StringRef ArchNameAsString);

@@ -797,6 +798,16 @@ static Triple::SubArchType parseSubArch(StringRef SubArchName) {
if (SubArchName == "arm64ec")
return Triple::AArch64SubArch_arm64ec;

if (SubArchName.starts_with("spir")) {
StringRef SubArch(SubArchName);
if (SubArch.consume_front("spir64_") || SubArch.consume_front("spir_")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should consider if we also want to support spirv64_gen

@bader Opinion?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leads me to a separate question - given we should aways have an arch value when targeting GPU, can we get away with only using spirv64 as the triple? This would have the tools handle the proper target behavior when the arch value is encountered and not rely on the triple arch/subarch. This transition from old to new model can also provide a clean break from using spir64.

Additional work would need to be done internally for existing usage of spir64 and spir64_gen on build command lines and convert those accordingly.

Copy link
Collaborator

@sarnex sarnex Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a good question, my understanding is AMD/NVIDIA don't support JIT at all, so they don't have to deal with AOT vs JIT, so we are the first ones.

I think this is a good topic for a discussion with a larger group, maybe the DPCPP technical forum?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW what will the arch be for JIT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent changes in the community have introduced a 'default' arch value, which would be used for JIT. In the packager, representation would be something like --image=file=file.bc,triple=spirv64-unknown-unknown,arch=generic,kind=sycl. Similarly for AOT, it would look like --image=file=file.bc,triple=spirv64-unknown-unknown,arch=bdw,kind=sycl. Of course, the arch value here is TBD.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, then i definitely like the idea of triple being only spirv64, and the arch specifying the device/specific hw.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, then i definitely like the idea of triple being only spirv64, and the arch specifying the device/specific hw.

May be the 'arch' can be just empty for JIT. Why do we need 'generic'?

Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generic is the default value that is now filled for arch= for the packager. This was introduced here: llvm#126655

@asudarsa
Copy link

asudarsa commented Apr 4, 2025

clang++ --offload-new-driver -fsycl --offload-arch=bdw // Offload SYCL code to Intel GPU
clang++ --offload-new-driver -fsycl --offload-arch=broadwell // Offload SYCL code to Intel CPU

To be honest, I found these names very confusing. As a user, it's hard to understand the difference between bdw and broadwell. Ideally, we should have code names (or some other identifiers) to separate GPU architecture from CPU architecture. For me, bdw sounds like CPU microarchitecture abbreviation.

Do you think it's possible to add igpu- prefix to official GPU architecture name to clearly separate from the CPU architectures (e.g. --offload-arch=igpu-bdw)? Another option is to add GPU generation code name: bdw-gen8.

+1 on @bader suggestion. Looking at other arches, we see that sm_* is used for NVidia backends, GFX* is used for AMD. it will be nice to have a 'Intel' prefix like 'igpu-'.

One other quick pointer: In a related PR, we are proposing to move OffloadingArch into its own header file. Please take a look: llvm#133194

Thanks

@@ -843,4 +843,14 @@ def warn_missing_include_dirs : Warning<

def err_drv_malformed_warning_suppression_mapping : Error<
"failed to process suppression mapping file '%0': %1">;

def err_drv_sycl_offload_arch_missing_value : Error<
"must pass in an explicit cpu or gpu architecture to '--offload-arch'">;
Copy link

@asudarsa asudarsa Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"must pass in an explicit cpu or gpu architecture to '--offload-arch'">;
"must pass in a valid cpu or gpu architecture string to '--offload-arch'">;

}

/// getSYCLTargetMacroArgs - return the previously gathered macro target args.
llvm::ArrayRef<std::string> getSYCLTargetMacroArgs() const {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call this getSYCLTargetMacros? Also, why do we need to include '-D' in the macro name? Can we just not add it when including it into a compilation command?
Just a nit....

@@ -833,10 +833,14 @@ Driver::OpenMPRuntimeKind Driver::getOpenMPRuntime(const ArgList &Args) const {

static llvm::Triple getSYCLDeviceTriple(StringRef TargetArch) {
SmallVector<StringRef, 5> SYCLAlias = {"spir", "spir64", "spirv", "spirv32",
"spirv64"};
"spirv64", "spir64_x86_64",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be less confusing to just have spirv, spirv32, spirv64 (for intel targets) here. Also, why is AMD missing here?

Thanks

@@ -1066,19 +1079,119 @@ void Driver::CreateOffloadingDeviceToolChains(Compilation &C,
// -ffreestanding cannot be used with -fsycl
argSYCLIncompatible(options::OPT_ffreestanding);

// Map of SYCL target triple strings to their corresponding target archs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have an implementation where we have a single triple for all Intel targets (spirv64/spirv32/spirv) and then architecture is specified solely by --offload-arch. I think we can get rid of spirv64_x86_64 and spirv_gen.

Thanks

@@ -46,8 +46,9 @@ void AMDGPUOpenMPToolChain::addClangTargetOptions(
Action::OffloadKind DeviceOffloadingKind) const {
HostTC.addClangTargetOptions(DriverArgs, CC1Args, DeviceOffloadingKind);

assert(DeviceOffloadingKind == Action::OFK_OpenMP &&
"Only OpenMP offloading kinds are supported.");
assert((DeviceOffloadingKind == Action::OFK_OpenMP ||
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are supported in what scenario?

@@ -5220,13 +5231,45 @@ void Clang::ConstructJob(Compilation &C, const JobAction &JA,
// Set O2 optimization level by default
if (!Args.getLastArg(options::OPT_O_Group))
CmdArgs.push_back("-O2");
// Add any predefined macros associated with intel_gpu* type targets
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment (with intel_gpu*) might need an update.

@@ -5220,13 +5231,45 @@ void Clang::ConstructJob(Compilation &C, const JobAction &JA,
// Set O2 optimization level by default
if (!Args.getLastArg(options::OPT_O_Group))
CmdArgs.push_back("-O2");
// Add any predefined macros associated with intel_gpu* type targets
// passed in with -fsycl-targets
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intend to use -fsycl-targets for passing target information?

Thanks

Comment on lines +5237 to +5239
// addition to the host compilation. There is no dependence connection
// between device and host where we should be able to use the offloading
// arch to add the macro to the host compile.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Might need a rewrite.

@@ -15,6 +15,222 @@ using namespace clang::driver::tools;
using namespace clang;
using namespace llvm::opt;

// Struct that relates an AOT target value with
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be merged with upstream logic. Please see https://github.com/llvm/llvm-project/pull/133194/files

Thanks


// SYCL AOT compilation to Intel CPUs using --offload-arch

// RUN: %clangxx -### --offload-new-driver -fsycl --offload-arch=broadwell %s 2>&1 | \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the --offload-new-driver flag here? I thought it was turned on by default in the upstream SYCL compilation flow.

Thanks

Copy link

@asudarsa asudarsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.
Three comments:

  1. I think all the offloadarch related logic should be moved to cuda.hpp (or a offloadarch.hpp which can reside in parallel with cuda.hpp and we can move all offloadarch related logic there)
  2. I think the triple should be just spirv64/spirv32/spirv for Intel targets and arch should be used for specifying AOT targets (can be empty for JIT).
  3. Please use intel specific prefix for Intel archs (e.g. igpu-bdw).

Also added a few minor comments.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants