-
Notifications
You must be signed in to change notification settings - Fork 15.6k
[update_test_checks] Generate multiple SAME lines to avoid back-reference issues. #172452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ence issues. If LLVM IR is put into `update_test_checks`, per-line captures will be generated in the resulting lit tests. On those lit tests, FileCheck will complain if more than 9 captures to back-reference are being used. This is usually the case if using lots of metadata on function signatures. A solution to this is to avoid having more than 9 captures by splitting -SAME lines into multiple -SAME lines.
|
@llvm/pr-subscribers-testing-tools Author: Thomas Symalla (tsymalla) ChangesIf LLVM IR is put into Full diff: https://github.com/llvm/llvm-project/pull/172452.diff 1 Files Affected:
diff --git a/llvm/utils/UpdateTestChecks/common.py b/llvm/utils/UpdateTestChecks/common.py
index baa0377cd8b81..10abd24ac29dd 100644
--- a/llvm/utils/UpdateTestChecks/common.py
+++ b/llvm/utils/UpdateTestChecks/common.py
@@ -24,7 +24,7 @@
2: --function-signature is now enabled by default and also checks return
type/attributes.
3: Opening parenthesis of function args is kept on the first LABEL line
- in case arguments are split to a separate SAME line.
+ in case arguments are split to separate SAME lines.
4: --check-globals now has a third option ('smart'). The others are now called
'none' and 'all'. 'smart' is the default.
5: Basic block labels are matched by FileCheck expressions
@@ -2177,7 +2177,8 @@ def add_checks(
no_meta_details=ginfo.no_meta_details(),
)[0]
func_name_separator = func_dict[checkprefix][func_name].func_name_separator
- if "[[" in args_and_sig:
+ BRACKET_PREFIX = "[["
+ if BRACKET_PREFIX in args_and_sig:
# Captures in label lines are not supported, thus split into a -LABEL
# and a separate -SAME line that contains the arguments with captures.
args_and_sig_prefix = ""
@@ -2202,9 +2203,39 @@ def add_checks(
func_name_separator,
)
)
- output_lines.append(
- "%s %s-SAME: %s" % (comment_marker, checkprefix, args_and_sig)
- )
+ # Split into multiple args_and_sig lines to avoid having more than 9 backreferences which
+ # are currently not supported.
+
+ # Find the start positions of the [[ markers.
+ BRACKET_SUFFIX = "]]"
+ split_marker_positions = [idx for idx in range(len(args_and_sig)) if args_and_sig.startswith(BRACKET_PREFIX, idx)]
+ # Divide the marker array into chunks of size 10 so we can cut off the current substring and put the remainder
+ # into a new line.
+ CHUNK_SIZE = 9
+ split_marker_chunks = [split_marker_positions[idx:idx + CHUNK_SIZE] for idx in range(0, len(split_marker_positions), CHUNK_SIZE)]
+
+ next_start = 0
+ for chunk_idx, chunk in enumerate(split_marker_chunks):
+ # Find the first closing bracket pair at the end position of this chunk.
+ if chunk_idx == len(split_marker_chunks) - 1:
+ # If this is the last chunk, then just use the remainder of the string.
+ current_substr = args_and_sig[next_start:]
+ else:
+ # Take the last element of the current chunk. This is the last occurrence of [[.
+ last_chunk_idx = chunk[len(chunk) - 1]
+ # Starting from this index in the substring, take the matching ]].
+ end_pos = args_and_sig.find(BRACKET_SUFFIX, last_chunk_idx)
+ end_pos += len(BRACKET_SUFFIX)
+ # Take the position after since this is usually a closing parenthesis or a white space.
+ end_pos += 1
+ # Slice from the start point to the last ]] and put it in the current SAME line.
+ current_substr = args_and_sig[next_start:end_pos]
+ # Start again from the end position.
+ next_start = end_pos
+
+ output_lines.append(
+ "%s %s-SAME: %s" % (comment_marker, checkprefix, current_substr)
+ )
else:
output_lines.append(
check_label_format
|
|
|
You can test this locally with the following command:darker --check --diff -r origin/main...HEAD llvm/utils/UpdateTestChecks/common.py
View the diff from darker here.--- common.py 2025-12-16 10:24:58.000000 +0000
+++ common.py 2025-12-16 10:27:21.459178 +0000
@@ -2206,15 +2206,22 @@
# Split into multiple args_and_sig lines to avoid having more than 9 backreferences which
# are currently not supported.
# Find the start positions of the [[ markers.
BRACKET_SUFFIX = "]]"
- split_marker_positions = [idx for idx in range(len(args_and_sig)) if args_and_sig.startswith(BRACKET_PREFIX, idx)]
+ split_marker_positions = [
+ idx
+ for idx in range(len(args_and_sig))
+ if args_and_sig.startswith(BRACKET_PREFIX, idx)
+ ]
# Divide the marker array into chunks of size 10 so we can cut off the current substring and put the remainder
# into a new line.
CHUNK_SIZE = 9
- split_marker_chunks = [split_marker_positions[idx:idx + CHUNK_SIZE] for idx in range(0, len(split_marker_positions), CHUNK_SIZE)]
+ split_marker_chunks = [
+ split_marker_positions[idx : idx + CHUNK_SIZE]
+ for idx in range(0, len(split_marker_positions), CHUNK_SIZE)
+ ]
next_start = 0
for chunk_idx, chunk in enumerate(split_marker_chunks):
# Find the first closing bracket pair at the end position of this chunk.
if chunk_idx == len(split_marker_chunks) - 1:
@@ -2226,11 +2233,11 @@
# Starting from this index in the substring, take the matching ]].
end_pos = args_and_sig.find(BRACKET_SUFFIX, last_chunk_idx)
end_pos += len(BRACKET_SUFFIX)
# Take the position after since this is usually a closing parenthesis or a white space.
end_pos += 1
- # Slice from the start point to the last ]] and put it in the current SAME line.
+ # Slice from the start point to the last ]] and put it in the current SAME line.
current_substr = args_and_sig[next_start:end_pos]
# Start again from the end position.
next_start = end_pos
output_lines.append(
|
| type/attributes. | ||
| 3: Opening parenthesis of function args is kept on the first LABEL line | ||
| in case arguments are split to a separate SAME line. | ||
| in case arguments are split to separate SAME lines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs to be a new version to avoid churn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case with more than 9 backreferences is probably rare enough to not warrant a version. Though from the PR description, I was unable to understand under which circumstances this occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileCheck states:
// Handle substitution of string variables that were defined earlier on
// the same line by emitting a backreference.
If we have more than 9 candidates capture groups in a single check line, then we will run into this issue.
It can happen in one function signature if the same metadata node is used among multiple metadata uses and there are more than 9 preceding capture groups (for instance, argument capture groups).
This might be a rare case but we run into it all the time.
It would be nicer to have FileCheck generate named capture groups instead of enumerating the capture groups and limiting them to 9, but it seems like this is not too easily. So, a straight-forward fix to resolve this issue is to split after 9 capture groups when regenerating tests.
See #96886.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the generated check lines look like this:
define void @func(i32 [[group1:[[0-9]+]], ..., i32 [[group9:[[0-9]+]]), !md1 [[MD1:[0-9]+]], !md2 [[MD1]]
We should run into this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, the problem is a backreference after a long list of unrelated capture groups.
I feel like UTC is not really the right place to address this. This splitting seems like something FileCheck could do internally. Though possibly the easiest thing to do would be to add support for more than 9 backreferences to the regex engine (I think this would require a small change to regcomp.c only).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not that straight-forward but I guess I can rather easily teach FileCheck to emit and handle {n} backref strings. I'll work on this.
| CHUNK_SIZE = 9 | ||
| split_marker_chunks = [split_marker_positions[idx:idx + CHUNK_SIZE] for idx in range(0, len(split_marker_positions), CHUNK_SIZE)] | ||
|
|
||
| next_start = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a test showing the new output?
If LLVM IR is put into
update_test_checks, per-line captures will be generated in the resulting lit tests.On those lit tests, FileCheck does not support more than 9 captures to back-reference. This usually leads to problems if using lots of metadata on function signatures. A solution to this is to avoid having more than 9 captures by splitting -SAME lines into multiple -SAME lines.
Note: This is a proposal to solving this problem. I would prefer a solution inside FileCheck itself, but it seems like this might be a larger endeavor for an issue that might only affect a small percentage of LLVM clients.
I'll add tests if we clarified on the path we will take to fix this issue.