Skip to content

Destroy unused witness table after hoisting #7009

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 9, 2025
Merged

Conversation

gtong-nv
Copy link
Contributor

@gtong-nv gtong-nv commented May 6, 2025

This PR turn on the debug tests in CI, which catches a validation error for the witness table using unseen inst.
The root cause is due to 0c5eee we create a new witness table but not destroying the original one, and during the IR validation the old one is triggering errors.

Fixes #6933

@gtong-nv gtong-nv requested a review from a team as a code owner May 6, 2025 05:13
@csyonghe
Copy link
Collaborator

csyonghe commented May 6, 2025

What’s the rationale for disabling validation in witness tables?

@gtong-nv gtong-nv marked this pull request as draft May 6, 2025 05:50
@gtong-nv
Copy link
Contributor Author

gtong-nv commented May 6, 2025

What’s the rationale for disabling validation in witness tables?

Sorry, this is not meant to be the final solution, but I'd like to test out if there are other failed tests in CI, because this is the only failed test I can reproduce locally.
For this change, I'm not familair how the witness table works, but I saw errors
internal error 40007: IR validation failed: def must come before use in same block in autodiff tests such as dynamic-dispatch-custom-bwd-derivative.slang, which is caused by
https://github.com/shader-slang/slang/blob/master/source/slang/slang-ir-validate.cpp#L180, when the operandValue in not in the seenInsts

I'm not sure if this is expected, as it seems working fine depiste of this validation error?

@csyonghe
Copy link
Collaborator

csyonghe commented May 6, 2025

IR validation errors like this is usually due to some other code breaking the assumption.

Our IR is designed around the assumption that an operand should be defined in a place that this visible to the inst, that means the operand itself needs to be either a global Inst, or defined in some parent of the inst, or in a basic block that dominates the current Inst.

The only exception we allow is when the Inst is a decoration. If we need to add other exceptions, we need to understand why first. Usually breaking this assumption means there is a bug.

This is similar to fixing other test failures: we usually don't fix test failures by deleting the test, unless we are certain that the test is obsolete.

@gtong-nv
Copy link
Contributor Author

gtong-nv commented May 8, 2025

The real issue for validation error "IR validation failed: def must come before use in same block" should have something to do with Make IRWitnessTable HOISTABLE . Confirmed by reverting that commit.

The error happens whenever we use BackwardDifferentiable with generics types.
Use dynamic-dispatch-custom-bwd-derivative.slang as an exmaple,
when we try to validate the witness table inst with
validateIRInstOperand(IRValidateContext* context, IRInst* inst, IRUse* operandUse)

operandUse is a StructType with name s_bwd_prop_sqr_outter_Intermediates, and it is not in context->seenInsts yet so the validation failed.
I noticed that that StructType (operand) and WitnessTable(inst) have the same block as the parent, whereas all the WitnessTables in passing tests have module as its parent - should all WitnessTables set parent to module after hoisting?

I also noticed that for the failed witness tables are created with

        builder.setInsertAfter(addMethod);
        auto diffTypeIsDiffWitness =
            builder.createWitnessTable(autodiffContext->differentiableInterfaceType, diffType);

and the builder has a Before mode in the insertLoc due to above. Does it mean the witness table is inserted to the list before the StructType (operand) ? Is this expected?

@csyonghe
Copy link
Collaborator

csyonghe commented May 8, 2025

It seems that in this case the struct type and the witness table are both inside a IRGeneric. It is valid for witness tables, functions and types to exist inside a IRBlock of a IRGeneric.

Does the struct type exist when the witness table is created?

@gtong-nv
Copy link
Contributor Author

gtong-nv commented May 8, 2025

Does the struct type exist when the witness table is created?

Yes, the struct type has smaller UI and is created before the witness table.

It seems that in this case the struct type and the witness table are both inside a IRGeneric

Looks like both have the following structs
IRModule -> IRGeneric -> IRBlock ->IRWitnessTable / IRStructType

What concerns me is that
m_insertLoc is AtEnd when the struct type is inserted to the list, but Before when we are call addGlobalValue() for the the witness table.
As a reulst, IRWitnessTable is inserted before the struct type -
in the screenshot below, the value is the the witness table we are creating, and the builder has the IRInsertLoc as shown in the defaultInsertLoc
image

@csyonghe
Copy link
Collaborator

csyonghe commented May 8, 2025

Is the last instruction of the block returning the witness table or returning the struct type?

@csyonghe
Copy link
Collaborator

csyonghe commented May 8, 2025

If it is a bug in creation of that witness table (inserting into the incorrect location), why does this only surface when the hoisting PR is merged? I suspect it is the hoisting logic moving the witness table into the wrong location.

@gtong-nv
Copy link
Contributor Author

gtong-nv commented May 8, 2025

I think it's returnning the struct type. Dumped all children in the block when the valiation error happens.

let %1 : BasicBlock = block

child op: param (10096)
[TypeConstraintDecoration(%IInterface)]
[nameHint("T")]
let %T : type_t = param

child op: param (10099)
let %1 : witness_table_t(%IInterface) = param

child op: specialize (11065)

child op: specialize (11033)

child op: specialize (11031)

child op: specialize (11000)

child op: specialize (10987)

child op: specialize (10947)

child op: witness_table (10925)
witness_table %1 : witness_table_t(%IDifferentiable)(specialize(%2, %T, %3))
{
witness_table_entry(%4,specialize(%2, %T, %3))

    witness_table_entry(%5,%1)

    witness_table_entry(%6,specialize(%7, %T, %3))

    witness_table_entry(%8,specialize(%9, %T, %3))

}

child op: witness_table (10927)
witness_table %1 : witness_table_t(%IDifferentiable)(%2)
{
witness_table_entry(%3,specialize(%4, %T, %5))

    witness_table_entry(%6,%7)

    witness_table_entry(%8,specialize(%9, %T, %5))

    witness_table_entry(%10,specialize(%11, %T, %5))

}

child op: struct (10111)
struct %1 : Type;

child op: return_val (10112)
return_val(%1)

@csyonghe
Copy link
Collaborator

csyonghe commented May 8, 2025

Is this witness table referenced by anything? If not it shouldn’t even be inserted.

@gtong-nv
Copy link
Contributor Author

gtong-nv commented May 8, 2025

If it is a bug in creation of that witness table (inserting into the incorrect location), why does this only surface when the hoisting PR is merged? I suspect it is the hoisting logic moving the witness table into the wrong location.

I reverted the hoisting PR, and looks like the insertion order is the same. But when it comes to validating, the same witness table is using IR_Specialize as the operand, instead of the unseen struct type. I guess it's the hoisting and dedpulication logic that surface the issue.

@gtong-nv
Copy link
Contributor Author

gtong-nv commented May 8, 2025

Debugging a little further, in the TOT with the hoisting logic, we have 2 witness tables in the same block. The first one is using IR_Specialize as operand, but the 2nd one is an unseen struct type (with UID 10111 below) as its opearnd.

======= validating the block =====
-------------- child 0: param (10096)
-------------- child 1: param (10099)
-------------- child 2: specialize (11065)
-------------- child 3: specialize (11033)
-------------- child 4: specialize (11031)
-------------- child 5: specialize (11000)
-------------- child 6: specialize (10987)
-------------- child 7: specialize (10947)
-------------- child 8: witness_table (10925)
-------------- child 9: witness_table (10927)
-------------- child 10: struct (10111)
-------------- child 11: return_val (10112)

In the old version, the 2 witness tables are still created, but only the first one is being validated ( not inserted to the list or removed?)

======= validating the block =====
-------------- child 0: param (10096)
-------------- child 1: param (10099)
-------------- child 2: specialize (11059)
-------------- child 3: specialize (11033)
-------------- child 4: specialize (11031)
-------------- child 5: specialize (11000)
-------------- child 6: specialize (10987)
-------------- child 7: specialize (10947)
-------------- child 8: witness_table (10925)
-------------- child 9: struct (10111)
-------------- child 10: return_val (10112)

I'm trying to find out the logic why we have 2 witness tables after the hositing PR

@gtong-nv gtong-nv changed the title Skip validating witness operands to fix tests in debug build Destroy unused witness table after hoisting May 8, 2025
@gtong-nv gtong-nv added the pr: non-breaking PRs without breaking changes label May 8, 2025
@gtong-nv gtong-nv marked this pull request as ready for review May 8, 2025 23:42
csyonghe
csyonghe previously approved these changes May 8, 2025
jkwak-work
jkwak-work previously approved these changes May 9, 2025
Copy link
Collaborator

@jkwak-work jkwak-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job fixing this.

Looks good to me.

@gtong-nv gtong-nv dismissed stale reviews from jkwak-work and csyonghe via caf7ec9 May 9, 2025 05:42
@gtong-nv
Copy link
Contributor Author

gtong-nv commented May 9, 2025

I'm not going to enable the debug test in this PR as there are still some other failures in the CI. I can't reproduce them locally, and I will try to find another way to debug it.

@gtong-nv gtong-nv enabled auto-merge (squash) May 9, 2025 05:44
mkeshavaNV added a commit to mkeshavaNV/slang that referenced this pull request May 9, 2025
This will be reverted because it is handed in
PR shader-slang#7009
Copy link
Collaborator

@jkwak-work jkwak-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Let's fix the other debug bugs later.

@gtong-nv gtong-nv merged commit 5e5c08d into master May 9, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr: non-breaking PRs without breaking changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable debug testing on linux and fix issues.
3 participants