-
Notifications
You must be signed in to change notification settings - Fork 13.6k
AMDGPU: Custom lower fptrunc vectors for f32 -> f16 #141883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.
@llvm/pr-subscribers-backend-amdgpu Author: Changpeng Fang (changpeng) ChangesThe latest asics support v_cvt_pk_f16_f32 instruction. However current implementation of vector fptrunc lowering fully scalarizes the vectors, and the scalar conversions may not always be combined to generate the packed one. Full diff: https://github.com/llvm/llvm-project/pull/141883.diff 5 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index 177750b639c67..9e9c0e5e18a6d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -1061,9 +1061,10 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
}
auto &FPTruncActions = getActionDefinitionsBuilder(G_FPTRUNC);
- if (ST.hasCvtPkF16F32Inst())
- FPTruncActions.legalFor({{S32, S64}, {S16, S32}, {V2S16, V2S32}});
- else
+ if (ST.hasCvtPkF16F32Inst()) {
+ FPTruncActions.legalFor({{S32, S64}, {S16, S32}, {V2S16, V2S32}})
+ .customFor({{V4S16, V4S32}, {V8S16, V8S32}});
+ } else
FPTruncActions.legalFor({{S32, S64}, {S16, S32}});
FPTruncActions.scalarize(0).lower();
@@ -2163,6 +2164,8 @@ bool AMDGPULegalizerInfo::legalizeCustom(
case TargetOpcode::G_FMINNUM_IEEE:
case TargetOpcode::G_FMAXNUM_IEEE:
return legalizeMinNumMaxNum(Helper, MI);
+ case TargetOpcode::G_FPTRUNC:
+ return legalizeFPTrunc(Helper, MI, MRI);
case TargetOpcode::G_EXTRACT_VECTOR_ELT:
return legalizeExtractVectorElt(MI, MRI, B);
case TargetOpcode::G_INSERT_VECTOR_ELT:
@@ -2749,6 +2752,20 @@ bool AMDGPULegalizerInfo::legalizeMinNumMaxNum(LegalizerHelper &Helper,
return Helper.lowerFMinNumMaxNum(MI) == LegalizerHelper::Legalized;
}
+bool AMDGPULegalizerInfo::legalizeFPTrunc(LegalizerHelper &Helper,
+ MachineInstr &MI,
+ MachineRegisterInfo &MRI) const {
+ Register DstReg = MI.getOperand(0).getReg();
+ LLT DstTy = MRI.getType(DstReg);
+ assert (DstTy.isVector() && DstTy.getNumElements() > 2);
+ LLT EltTy = DstTy.getElementType();
+ assert (EltTy == S16 && "Only handle vectors of half");
+
+ // Split vector to packs.
+ return Helper.fewerElementsVector(MI, 0, LLT::fixed_vector(2, EltTy)) ==
+ LegalizerHelper::Legalized;
+}
+
bool AMDGPULegalizerInfo::legalizeExtractVectorElt(
MachineInstr &MI, MachineRegisterInfo &MRI,
MachineIRBuilder &B) const {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h
index 1f4e02b0d600a..faab326f53fcf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h
@@ -56,6 +56,8 @@ class AMDGPULegalizerInfo final : public LegalizerInfo {
bool legalizeFPTOI(MachineInstr &MI, MachineRegisterInfo &MRI,
MachineIRBuilder &B, bool Signed) const;
bool legalizeMinNumMaxNum(LegalizerHelper &Helper, MachineInstr &MI) const;
+ bool legalizeFPTrunc(LegalizerHelper &Helper, MachineInstr &MI,
+ MachineRegisterInfo &MRI) const;
bool legalizeExtractVectorElt(MachineInstr &MI, MachineRegisterInfo &MRI,
MachineIRBuilder &B) const;
bool legalizeInsertVectorElt(MachineInstr &MI, MachineRegisterInfo &MRI,
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 74ca3e43fce3a..8d68e5a0d064a 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -919,8 +919,10 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
setOperationAction(ISD::BUILD_VECTOR, MVT::v2bf16, Legal);
}
- if (Subtarget->hasCvtPkF16F32Inst())
- setOperationAction(ISD::FP_ROUND, MVT::v2f16, Custom);
+ if (Subtarget->hasCvtPkF16F32Inst()) {
+ setOperationAction(ISD::FP_ROUND, {MVT::v2f16, MVT::v4f16, MVT::v8f16},
+ Custom);
+ }
setTargetDAGCombine({ISD::ADD,
ISD::UADDO_CARRY,
@@ -6900,14 +6902,44 @@ SDValue SITargetLowering::getFPExtOrFPRound(SelectionDAG &DAG, SDValue Op,
DAG.getTargetConstant(0, DL, MVT::i32));
}
+SDValue SITargetLowering::SplitFP_ROUNDVectorToPacks(SDValue Op,
+ SelectionDAG &DAG) const {
+ unsigned Opc = Op.getOpcode();
+ EVT DstVT = Op.getValueType();
+ unsigned NumElts = DstVT.getVectorNumElements();
+ assert (NumElts % 2 == 0 && "Only handle vectors of even number of elements");
+ if (NumElts == 2) // already packed.
+ return Op;
+
+ SDValue Src = Op.getOperand(0);
+ EVT SrcVT = Src.getValueType();
+ LLVMContext &Context = *DAG.getContext();
+ EVT SrcPkVT = EVT::getVectorVT(Context, SrcVT.getScalarType(), 2);
+ EVT DstPkVT = EVT::getVectorVT(Context, DstVT.getScalarType(), 2);
+
+ SDLoc DL(Op);
+ SmallVector<SDValue, 16> Packs;
+ for (unsigned Index = 0; Index < NumElts; Index +=2) {
+ SDValue PkSrc = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, SrcPkVT, Src,
+ DAG.getConstant(Index, DL, MVT::i32));
+ SDValue PkDst = DAG.getNode(Opc, DL, DstPkVT, PkSrc,
+ DAG.getTargetConstant(0, DL, MVT::i32));
+ Packs.push_back(PkDst);
+ }
+
+ return DAG.getNode(ISD::CONCAT_VECTORS, DL, DstVT, Packs);
+}
+
SDValue SITargetLowering::lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const {
SDValue Src = Op.getOperand(0);
EVT SrcVT = Src.getValueType();
EVT DstVT = Op.getValueType();
- if (DstVT == MVT::v2f16) {
+ if (DstVT.isVector() && DstVT.getScalarType() == MVT::f16) {
assert(Subtarget->hasCvtPkF16F32Inst() && "support v_cvt_pk_f16_f32");
- return SrcVT == MVT::v2f32 ? Op : SDValue();
+ if (SrcVT.getScalarType() != MVT::f32)
+ return SDValue();
+ return SplitFP_ROUNDVectorToPacks(Op, DAG);
}
if (SrcVT.getScalarType() != MVT::f64)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index c42366a1c04c8..1c5ec0e5a25a1 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -145,6 +145,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
/// Custom lowering for ISD::FP_ROUND for MVT::f16.
SDValue lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const;
+ SDValue SplitFP_ROUNDVectorToPacks(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerFMINNUM_FMAXNUM(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerFMINIMUM_FMAXIMUM(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerFLDEXP(SDValue Op, SelectionDAG &DAG) const;
diff --git a/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll
index e5815e96fbe33..b4c8dac3ff518 100644
--- a/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll
+++ b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll
@@ -12,18 +12,124 @@ define <2 x half> @v_test_cvt_v2f32_v2f16(<2 x float> %src) {
ret <2 x half> %res
}
-define half @fptrunc_v2f32_v2f16_then_extract(<2 x float> %src) {
-; GFX950-LABEL: fptrunc_v2f32_v2f16_then_extract:
+define <4 x half> @v_test_cvt_v4f32_v4f16(<4 x float> %src) {
+; GFX950-LABEL: v_test_cvt_v4f32_v4f16:
; GFX950: ; %bb.0:
; GFX950-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX950-NEXT: v_cvt_pk_f16_f32 v0, v0, v1
-; GFX950-NEXT: v_add_f16_sdwa v0, v0, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
+; GFX950-NEXT: v_cvt_pk_f16_f32 v1, v2, v3
+; GFX950-NEXT: s_setpc_b64 s[30:31]
+ %res = fptrunc <4 x float> %src to <4 x half>
+ ret <4 x half> %res
+}
+
+define <8 x half> @v_test_cvt_v8f32_v2f16(<8 x float> %src) {
+; GFX950-LABEL: v_test_cvt_v8f32_v2f16:
+; GFX950: ; %bb.0:
+; GFX950-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-NEXT: v_cvt_pk_f16_f32 v0, v0, v1
+; GFX950-NEXT: v_cvt_pk_f16_f32 v1, v2, v3
+; GFX950-NEXT: v_cvt_pk_f16_f32 v2, v4, v5
+; GFX950-NEXT: v_cvt_pk_f16_f32 v3, v6, v7
+; GFX950-NEXT: s_setpc_b64 s[30:31]
+ %res = fptrunc <8 x float> %src to <8 x half>
+ ret <8 x half> %res
+}
+
+define half @fptrunc_v2f32_v2f16_extract_uses(<2 x float> %src) {
+; GFX950-LABEL: fptrunc_v2f32_v2f16_extract_uses:
+; GFX950: ; %bb.0:
+; GFX950-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-NEXT: v_cvt_pk_f16_f32 v0, v0, v1
+; GFX950-NEXT: v_add_f16_sdwa v0, v0, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
; GFX950-NEXT: s_setpc_b64 s[30:31]
%vec_half = fptrunc <2 x float> %src to <2 x half>
- %first = extractelement <2 x half> %vec_half, i64 1
- %second = extractelement <2 x half> %vec_half, i64 0
- %res = fadd half %first, %second
- ret half %res
+ %f0 = extractelement <2 x half> %vec_half, i64 0
+ %f1 = extractelement <2 x half> %vec_half, i64 1
+ %rslt = fadd half %f0, %f1
+ ret half %rslt
+}
+
+define half @fptrunc_v4f32_v4f16_extract_uses(<4 x float> %vec_float) {
+; GFX950-SDAG-LABEL: fptrunc_v4f32_v4f16_extract_uses:
+; GFX950-SDAG: ; %bb.0:
+; GFX950-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-SDAG-NEXT: v_cvt_pk_f16_f32 v2, v2, v3
+; GFX950-SDAG-NEXT: v_cvt_pk_f16_f32 v0, v0, v1
+; GFX950-SDAG-NEXT: v_add_f16_sdwa v0, v0, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
+; GFX950-SDAG-NEXT: v_add_f16_sdwa v1, v2, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
+; GFX950-SDAG-NEXT: v_add_f16_e32 v0, v0, v1
+; GFX950-SDAG-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX950-GISEL-LABEL: fptrunc_v4f32_v4f16_extract_uses:
+; GFX950-GISEL: ; %bb.0:
+; GFX950-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-GISEL-NEXT: v_cvt_pk_f16_f32 v0, v0, v1
+; GFX950-GISEL-NEXT: v_cvt_pk_f16_f32 v1, v2, v3
+; GFX950-GISEL-NEXT: v_add_f16_sdwa v0, v0, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
+; GFX950-GISEL-NEXT: v_add_f16_sdwa v1, v1, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
+; GFX950-GISEL-NEXT: v_add_f16_e32 v0, v0, v1
+; GFX950-GISEL-NEXT: s_setpc_b64 s[30:31]
+ %vec_half = fptrunc <4 x float> %vec_float to <4 x half>
+ %f0 = extractelement <4 x half> %vec_half, i64 0
+ %f1 = extractelement <4 x half> %vec_half, i64 1
+ %f2 = extractelement <4 x half> %vec_half, i64 2
+ %f3 = extractelement <4 x half> %vec_half, i64 3
+ %sum0 = fadd half %f0, %f1
+ %sum1 = fadd half %f2, %f3
+ %rslt = fadd half %sum0, %sum1
+ ret half %rslt
+}
+
+define half @fptrunc_v8f32_v8f16_extract_uses(<8 x float> %vec_float) {
+; GFX950-SDAG-LABEL: fptrunc_v8f32_v8f16_extract_uses:
+; GFX950-SDAG: ; %bb.0:
+; GFX950-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-SDAG-NEXT: v_cvt_pk_f16_f32 v6, v6, v7
+; GFX950-SDAG-NEXT: v_cvt_pk_f16_f32 v4, v4, v5
+; GFX950-SDAG-NEXT: v_cvt_pk_f16_f32 v2, v2, v3
+; GFX950-SDAG-NEXT: v_cvt_pk_f16_f32 v0, v0, v1
+; GFX950-SDAG-NEXT: v_add_f16_sdwa v0, v0, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
+; GFX950-SDAG-NEXT: v_add_f16_sdwa v1, v2, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
+; GFX950-SDAG-NEXT: v_add_f16_sdwa v2, v4, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
+; GFX950-SDAG-NEXT: v_add_f16_sdwa v3, v6, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
+; GFX950-SDAG-NEXT: v_add_f16_e32 v0, v0, v1
+; GFX950-SDAG-NEXT: v_add_f16_e32 v1, v2, v3
+; GFX950-SDAG-NEXT: v_add_f16_e32 v0, v0, v1
+; GFX950-SDAG-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX950-GISEL-LABEL: fptrunc_v8f32_v8f16_extract_uses:
+; GFX950-GISEL: ; %bb.0:
+; GFX950-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-GISEL-NEXT: v_cvt_pk_f16_f32 v0, v0, v1
+; GFX950-GISEL-NEXT: v_cvt_pk_f16_f32 v1, v2, v3
+; GFX950-GISEL-NEXT: v_cvt_pk_f16_f32 v2, v4, v5
+; GFX950-GISEL-NEXT: v_cvt_pk_f16_f32 v3, v6, v7
+; GFX950-GISEL-NEXT: v_add_f16_sdwa v0, v0, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
+; GFX950-GISEL-NEXT: v_add_f16_sdwa v1, v1, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
+; GFX950-GISEL-NEXT: v_add_f16_sdwa v2, v2, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
+; GFX950-GISEL-NEXT: v_add_f16_sdwa v3, v3, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
+; GFX950-GISEL-NEXT: v_add_f16_e32 v0, v0, v1
+; GFX950-GISEL-NEXT: v_add_f16_e32 v1, v2, v3
+; GFX950-GISEL-NEXT: v_add_f16_e32 v0, v0, v1
+; GFX950-GISEL-NEXT: s_setpc_b64 s[30:31]
+ %vec_half = fptrunc <8 x float> %vec_float to <8 x half>
+ %f0 = extractelement <8 x half> %vec_half, i64 0
+ %f1 = extractelement <8 x half> %vec_half, i64 1
+ %f2 = extractelement <8 x half> %vec_half, i64 2
+ %f3 = extractelement <8 x half> %vec_half, i64 3
+ %f4 = extractelement <8 x half> %vec_half, i64 4
+ %f5 = extractelement <8 x half> %vec_half, i64 5
+ %f6 = extractelement <8 x half> %vec_half, i64 6
+ %f7 = extractelement <8 x half> %vec_half, i64 7
+ %sum0 = fadd half %f0, %f1
+ %sum1 = fadd half %f2, %f3
+ %sum2 = fadd half %f4, %f5
+ %sum3 = fadd half %f6, %f7
+ %sum4 = fadd half %sum0, %sum1
+ %sum5 = fadd half %sum2, %sum3
+ %rslt = fadd half %sum4, %sum5
+ ret half %rslt
}
define <2 x half> @v_test_cvt_v2f64_v2f16(<2 x double> %src) {
|
You can test this locally with the following command:git-clang-format --diff HEAD~1 HEAD --extensions h,cpp -- llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp llvm/lib/Target/AMDGPU/SIISelLowering.cpp llvm/lib/Target/AMDGPU/SIISelLowering.h View the diff from clang-format here.diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 1e6eb080b..c62951297 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -920,8 +920,9 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
}
if (Subtarget->hasCvtPkF16F32Inst()) {
- setOperationAction(ISD::FP_ROUND, {MVT::v2f16, MVT::v4f16, MVT::v8f16,
- MVT::v16f16}, Custom);
+ setOperationAction(ISD::FP_ROUND,
+ {MVT::v2f16, MVT::v4f16, MVT::v8f16, MVT::v16f16},
+ Custom);
}
setTargetDAGCombine({ISD::ADD,
@@ -6913,8 +6914,8 @@ SDValue SITargetLowering::splitFP_ROUNDVectorOp(SDValue Op,
SDLoc DL(Op);
unsigned Opc = Op.getOpcode();
SDValue Flags = Op.getOperand(1);
- EVT HalfDstVT = EVT::getVectorVT(*DAG.getContext(), DstVT.getScalarType(),
- NumElts / 2);
+ EVT HalfDstVT =
+ EVT::getVectorVT(*DAG.getContext(), DstVT.getScalarType(), NumElts / 2);
SDValue OpLo = DAG.getNode(Opc, DL, HalfDstVT, Lo, Flags);
SDValue OpHi = DAG.getNode(Opc, DL, HalfDstVT, Hi, Flags);
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good but please fix the clang-format issue
GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated. Fix clang-format
GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated. NOTE: 1) re-implement the helper function the sam way as splitUnaryVectorOp; 2) Use fewerElementsIf instead of custom lowering on the global isel path 3) add checks for 3x and 16x vectors
GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated. NOTE: minor changes
GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated. NOTE: Use .clampMaxNumElements(0, S16, 2)
Any further suggestions/comments? Thanks. |
The latest asics support v_cvt_pk_f16_f32 instruction. However current implementation of vector fptrunc lowering fully scalarizes the vectors, and the scalar conversions may not always be combined to generate the packed one.
We made v2f32 -> v2f16 legal in #139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.