Skip to content

Conversation

@tsymalla
Copy link
Contributor

@tsymalla tsymalla commented Dec 16, 2025

If LLVM IR is put into update_test_checks, per-line captures will be generated in the resulting lit tests.
On those lit tests, FileCheck does not support more than 9 captures to back-reference. This usually leads to problems if using lots of metadata on function signatures. A solution to this is to avoid having more than 9 captures by splitting -SAME lines into multiple -SAME lines.

Note: This is a proposal to solving this problem. I would prefer a solution inside FileCheck itself, but it seems like this might be a larger endeavor for an issue that might only affect a small percentage of LLVM clients.
I'll add tests if we clarified on the path we will take to fix this issue.

…ence issues.

If LLVM IR is put into `update_test_checks`, per-line captures will be
generated in the resulting lit tests.
On those lit tests, FileCheck will complain if more than 9
captures to back-reference are being used. This is usually the case if using lots of metadata on function signatures. A solution to this is to avoid having more than 9 captures by splitting -SAME lines into multiple -SAME lines.
@llvmbot
Copy link
Member

llvmbot commented Dec 16, 2025

@llvm/pr-subscribers-testing-tools

Author: Thomas Symalla (tsymalla)

Changes

If LLVM IR is put into update_test_checks, per-line captures will be generated in the resulting lit tests.
On those lit tests, FileCheck will complain if more than 9 captures to back-reference are being used. This is usually the case if using lots of metadata on function signatures. A solution to this is to avoid having more than 9 captures by splitting -SAME lines into multiple -SAME lines.


Full diff: https://github.com/llvm/llvm-project/pull/172452.diff

1 Files Affected:

  • (modified) llvm/utils/UpdateTestChecks/common.py (+36-5)
diff --git a/llvm/utils/UpdateTestChecks/common.py b/llvm/utils/UpdateTestChecks/common.py
index baa0377cd8b81..10abd24ac29dd 100644
--- a/llvm/utils/UpdateTestChecks/common.py
+++ b/llvm/utils/UpdateTestChecks/common.py
@@ -24,7 +24,7 @@
 2: --function-signature is now enabled by default and also checks return
    type/attributes.
 3: Opening parenthesis of function args is kept on the first LABEL line
-   in case arguments are split to a separate SAME line.
+   in case arguments are split to separate SAME lines.
 4: --check-globals now has a third option ('smart'). The others are now called
    'none' and 'all'. 'smart' is the default.
 5: Basic block labels are matched by FileCheck expressions
@@ -2177,7 +2177,8 @@ def add_checks(
                     no_meta_details=ginfo.no_meta_details(),
                 )[0]
             func_name_separator = func_dict[checkprefix][func_name].func_name_separator
-            if "[[" in args_and_sig:
+            BRACKET_PREFIX = "[["
+            if BRACKET_PREFIX in args_and_sig:
                 # Captures in label lines are not supported, thus split into a -LABEL
                 # and a separate -SAME line that contains the arguments with captures.
                 args_and_sig_prefix = ""
@@ -2202,9 +2203,39 @@ def add_checks(
                         func_name_separator,
                     )
                 )
-                output_lines.append(
-                    "%s %s-SAME: %s" % (comment_marker, checkprefix, args_and_sig)
-                )
+                # Split into multiple args_and_sig lines to avoid having more than 9 backreferences which
+                # are currently not supported.
+
+                # Find the start positions of the [[ markers.
+                BRACKET_SUFFIX = "]]"
+                split_marker_positions = [idx for idx in range(len(args_and_sig)) if args_and_sig.startswith(BRACKET_PREFIX, idx)]
+                # Divide the marker array into chunks of size 10 so we can cut off the current substring and put the remainder
+                # into a new line.
+                CHUNK_SIZE = 9
+                split_marker_chunks = [split_marker_positions[idx:idx + CHUNK_SIZE] for idx in range(0, len(split_marker_positions), CHUNK_SIZE)]
+
+                next_start = 0
+                for chunk_idx, chunk in enumerate(split_marker_chunks):
+                    # Find the first closing bracket pair at the end position of this chunk.
+                    if chunk_idx == len(split_marker_chunks) - 1:
+                        # If this is the last chunk, then just use the remainder of the string.
+                        current_substr = args_and_sig[next_start:]
+                    else:
+                        # Take the last element of the current chunk. This is the last occurrence of [[.
+                        last_chunk_idx = chunk[len(chunk) - 1]
+                        # Starting from this index in the substring, take the matching ]].
+                        end_pos = args_and_sig.find(BRACKET_SUFFIX, last_chunk_idx)
+                        end_pos += len(BRACKET_SUFFIX)
+                        # Take the position after since this is usually a closing parenthesis or a white space.
+                        end_pos += 1
+                        # Slice from the start point to the last ]] and put it in the current SAME line. 
+                        current_substr = args_and_sig[next_start:end_pos]
+                        # Start again from the end position.
+                        next_start = end_pos
+
+                    output_lines.append(
+                        "%s %s-SAME: %s" % (comment_marker, checkprefix, current_substr)
+                    )
             else:
                 output_lines.append(
                     check_label_format

@github-actions
Copy link

⚠️ We detected that you are using a GitHub private e-mail address to contribute to the repo.
Please turn off Keep my email addresses private setting in your account.
See LLVM Developer Policy and LLVM Discourse for more information.

@github-actions
Copy link

⚠️ Python code formatter, darker found issues in your code. ⚠️

You can test this locally with the following command:
darker --check --diff -r origin/main...HEAD llvm/utils/UpdateTestChecks/common.py

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from darker here.
--- common.py	2025-12-16 10:24:58.000000 +0000
+++ common.py	2025-12-16 10:27:21.459178 +0000
@@ -2206,15 +2206,22 @@
                 # Split into multiple args_and_sig lines to avoid having more than 9 backreferences which
                 # are currently not supported.
 
                 # Find the start positions of the [[ markers.
                 BRACKET_SUFFIX = "]]"
-                split_marker_positions = [idx for idx in range(len(args_and_sig)) if args_and_sig.startswith(BRACKET_PREFIX, idx)]
+                split_marker_positions = [
+                    idx
+                    for idx in range(len(args_and_sig))
+                    if args_and_sig.startswith(BRACKET_PREFIX, idx)
+                ]
                 # Divide the marker array into chunks of size 10 so we can cut off the current substring and put the remainder
                 # into a new line.
                 CHUNK_SIZE = 9
-                split_marker_chunks = [split_marker_positions[idx:idx + CHUNK_SIZE] for idx in range(0, len(split_marker_positions), CHUNK_SIZE)]
+                split_marker_chunks = [
+                    split_marker_positions[idx : idx + CHUNK_SIZE]
+                    for idx in range(0, len(split_marker_positions), CHUNK_SIZE)
+                ]
 
                 next_start = 0
                 for chunk_idx, chunk in enumerate(split_marker_chunks):
                     # Find the first closing bracket pair at the end position of this chunk.
                     if chunk_idx == len(split_marker_chunks) - 1:
@@ -2226,11 +2233,11 @@
                         # Starting from this index in the substring, take the matching ]].
                         end_pos = args_and_sig.find(BRACKET_SUFFIX, last_chunk_idx)
                         end_pos += len(BRACKET_SUFFIX)
                         # Take the position after since this is usually a closing parenthesis or a white space.
                         end_pos += 1
-                        # Slice from the start point to the last ]] and put it in the current SAME line. 
+                        # Slice from the start point to the last ]] and put it in the current SAME line.
                         current_substr = args_and_sig[next_start:end_pos]
                         # Start again from the end position.
                         next_start = end_pos
 
                     output_lines.append(

type/attributes.
3: Opening parenthesis of function args is kept on the first LABEL line
in case arguments are split to a separate SAME line.
in case arguments are split to separate SAME lines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be a new version to avoid churn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case with more than 9 backreferences is probably rare enough to not warrant a version. Though from the PR description, I was unable to understand under which circumstances this occurs.

Copy link
Contributor Author

@tsymalla tsymalla Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileCheck states:

        // Handle substitution of string variables that were defined earlier on
        // the same line by emitting a backreference.

If we have more than 9 candidates capture groups in a single check line, then we will run into this issue.
It can happen in one function signature if the same metadata node is used among multiple metadata uses and there are more than 9 preceding capture groups (for instance, argument capture groups).
This might be a rare case but we run into it all the time.
It would be nicer to have FileCheck generate named capture groups instead of enumerating the capture groups and limiting them to 9, but it seems like this is not too easily. So, a straight-forward fix to resolve this issue is to split after 9 capture groups when regenerating tests.
See #96886.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the generated check lines look like this:

define void @func(i32 [[group1:[[0-9]+]], ..., i32 [[group9:[[0-9]+]]), !md1 [[MD1:[0-9]+]], !md2 [[MD1]]

We should run into this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, the problem is a backreference after a long list of unrelated capture groups.

I feel like UTC is not really the right place to address this. This splitting seems like something FileCheck could do internally. Though possibly the easiest thing to do would be to add support for more than 9 backreferences to the regex engine (I think this would require a small change to regcomp.c only).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not that straight-forward but I guess I can rather easily teach FileCheck to emit and handle {n} backref strings. I'll work on this.

CHUNK_SIZE = 9
split_marker_chunks = [split_marker_positions[idx:idx + CHUNK_SIZE] for idx in range(0, len(split_marker_positions), CHUNK_SIZE)]

next_start = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test showing the new output?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants