Skip to content

Conversation

@NWilson
Copy link
Member

@NWilson NWilson commented Sep 30, 2025

With thanks to @IsaacOscar.

This splits out two commits from Isaac's open PR.

I have:

  • Tried to reduce the number of lines of change in the _match.c files, by preserving a little more of the original behaviour. The main change is now simply that we set the subject field when returning NOMATCH.
  • Merged the "different subject" and "different subject length" errors.
  • Simplified the logic around PCRE2_COPY_MATCHED_SUBJECT. Isaac had a system where you could opt to pass in a NULL subject to pcre2_substitute if the match data had a copied subject. However, I don't see a strong need to allow that (in fact, the entire option PCRE2_COPY_MATCHED_SUBJECT has a fairly weak motivation), so from my point of view it's a simpler narrative to just say, "PCRE2_COPY_MATCHED_SUBJECT doesn't affect the documented restrictions of passing in the same subject to pcre2_match and pcre2_substitute". Simpler than adding yet more conditional logic to the documentation of what is and isn't allowed.
  • TODO I shall also cherry-pick over the pcre2test changes
  • TODO I think I want to add a check also to ensure that the match options haven't changed
  • TODO pcre2test help text and man page update
  • TODO pcre2_substitute man page update
  • TODO testdata test - DFA; fast-JIT; no-match; match-error; subject length shrinking; copied-subject with&without change
  • TODO specific unit testing errors - different pointers/offset/options; subject lengthening; NULL-to-nonnull compat
  • TODO finish moving tests from testinputNEW to existing test files

IsaacOscar and others added 6 commits September 30, 2025 09:59
This adds three new error codes to pcre2_substitute when using
PCRE2_SUBSTITUTE_MATCHED:
* PCRE2_ERROR_DIFFERENT_SUBJECT: returned if if the subject pointer you passed is
  different from the prior call to pcre2_match
* PCRE2_ERROR_DIFFERENT_LENGTH: if the computed length differs from the
  pcre2_match call (i.e. after processing PCRE2_ZERO_TERMINATED)
* PCRE2_ERROR_DIFFERENT_OFFSET if the start offset differs from the
  pcre2_match call.

(cherry picked from commit 0d7e6d5)
Specficially, when pcre2_match is called with PCRE2_COPY_MATCHED_SUBJECT,
and then pcre2_substitute is called with PCRE2_SUBSTITUTE_MATCHED,
the subject data saved by the PCRE2_COPY_MATCHED_SUBJECT call will be used,

In addition, the subject to pcre2_substitute can be NULL.
Moreover, the length can also be PCRE2_ZERO_TERMINATE, and the original
length will also be used.

(cherry picked from commit 9651bcf)
The following subject modifiers have been added:
* substitute_subject=<string>
    like replace=<string> but no [...] syntax
    This is for use with substitute_matched, it makes pcre2test use the given string for
    the call to pcre2_substitute (instead of the subject passed to pcre2_match)

(cherry picked from commit 598ad44)
@IsaacOscar
Copy link
Contributor

IsaacOscar commented Oct 1, 2025

  • Tried to reduce the number of lines of change in the _match.c files, by preserving a little more of the original behaviour. The main change is now simply that we set the subject field when returning NOMATCH.

FYI I recall moving some of the PCRE2_COPY_MATCHED_SUBJECT code around to avoid double frees/memory leaks/invalid reads. So make sure all the test cases (including my ones) are passing under the -valgrind option to RunTest.

I also noticed you deleted my original_subject field and instead allow a 0-length subject to change pointer. I suggest not mentioning this strange exception to the normal rule in the documentation.

  • Merged the "different subject" and "different subject length" errors.

My last commit on my pull request also adds a PCRE2_ERROR_BADUTFCAPTURE error, but that should only be possible if you modify the subject, so you might want to change that to PCRE2_ERROR_DIFFSUBSSUBJECT as well. (It can also return normal UTF validity errors, which can similarly be changed to PCRE2_ERROR_DIFFSUBSSUBJECT).

Then the documentation can roughly say:

  1. You should not change change the subject from the preceding call to pcre2_match.
  2. If you do do that anyway, you may get a PCRE2_ERROR_DIFFSUBSSUBJECT error.
  3. If you don't get an error however, what references in the replacement string get substituted with is unspecified, but will not violate any UTF guarantees.
  • Simplified the logic around PCRE2_COPY_MATCHED_SUBJECT. Isaac had a system where you could opt to pass in a NULL subject to pcre2_substitute if the match data had a copied subject. However, I don't see a strong need to allow that (in fact, the entire option PCRE2_COPY_MATCHED_SUBJECT has a fairly weak motivation), so from my point of view it's a simpler narrative to just say, "PCRE2_COPY_MATCHED_SUBJECT doesn't affect the documented restrictions of passing in the same subject to pcre2_match and pcre2_substitute". Simpler than adding yet more conditional logic to the documentation of what is and isn't allowed.

How about a sentence like this:

However, if PCRE2_COPY_MATCHED_SUBJECT was used in the prior call to pcre2_match, then the internally copied subject is used, and the subject and subject_length parameters to pcre2_substitute are ignored.

Otherwise you're just wasting memory requiring the caller of pcre2_substitute to maintain the original version of the subject, when the whole point of PCRE2_COPY_MATCHED_SUBJECT is for you to not have to do that.

  • TODO I think I want to add a check also to ensure that the match options haven't changed

In my main pcre2test commit, I added a substitute_options modifier which I was planning on using to write test cases for this.

However I think I'm missing a few modifiers, so when cherry picking that over, make sure the bit that says:

      PCRE2_ANCHORED | PCRE2_NO_UTF_CHECK | PCRE2_NO_JIT |  PCRE2_ENDANCHORED |
      PCRE2_NOTBOL | PCRE2_NOTEOL | PCRE2_NOTEMPTY | PCRE2_NOTEMPTY_ATSTART

Should also have:

   PCRE2_COPY_MATCHED_SUBJECT | PCRE2_DISABLE_RECURSELOOP_CHECK

And modify #define SUBSTITUTE_OPTIONS_MODSIZE to be 121.

@IsaacOscar
Copy link
Contributor

I noticed your most recent commit, 071b626 seems to prohibit using pcre2_substitute after pcre2_dfa_match. Is there any reason for this? (I recall testing things before and this did work fine).

@NWilson
Copy link
Member Author

NWilson commented Oct 1, 2025

FYI I recall moving some of the PCRE2_COPY_MATCHED_SUBJECT code around to avoid double frees/memory leaks/invalid reads. So make sure all the test cases (including my ones) are passing under the -valgrind option to RunTest.

I have already merged quite a few of your commits to master.

The CI testing does run the tests with -valgrind, so it's not possible for that fail without us noticing immediately.

I also noticed you deleted my original_subject field and instead allow a 0-length subject to change pointer. I suggest not mentioning this strange exception to the normal rule in the documentation.

Good point. I actually deleted subject and renamed original_subject to subject, because I agreed that we did need to record the original subject.

Now that you mention it, I feel mildly guilty about that exception for 0-length subjects. I will replace the length == 0 condition with a stricter one that allows non-equal subject pointers, only if one or both of the pointers is NULL (or something like that).

My last commit on my pull request also adds a PCRE2_ERROR_BADUTFCAPTURE error

I'll get to that! I'm just merging your work commit-by-commit.

To be honest, I'm not sure what the point of PCRE2_COPY_MATCHED_SUBJECT is at all. If the caller wants to copy the subject, then... they can just do that!?! Why PCRE2 has to do the copy is unclear to me.

In my main pcre2test commit, I added a substitute_options modifier which I was planning on using to write test cases for this.

I think I'm actually not going to add pcre2test modifiers for all these invalid edge cases. pcre2test should allow us to exercise all the documented-legal behaviour in PCRE2, and some of the illegal stuff, but there's so much possible illegal behaviour it would be wild if we added modifiers for them all (particularly if that modifier is only going to be used in literally one single test in the whole suite).

Instead I think I am going to move those tests into a dedicated "unittest()" function that exercises specific error conditions.

@IsaacOscar
Copy link
Contributor

To be honest, I'm not sure what the point of PCRE2_COPY_MATCHED_SUBJECT is at all. If the caller wants to copy the subject, then... they can just do that!?! Why PCRE2 has to do the copy is unclear to me.

Agreed. I just think if a feature already exists it shouldn't be arbitrarily limited.

Instead I think I am going to move those tests into a dedicated "unittest()" function that exercises specific error conditions.

That would have been much easier than modifying pcre2test (I spent more time doing that than the actual behaviour changes I made).

@IsaacOscar
Copy link
Contributor

The CI testing does run the tests with -valgrind, so it's not possible for that fail without us noticing immediately.

Is the CI running my new tests? I didn't notice any of your comments modifying the CI or RunTest to include them. As I just lazilly put them in a totally new testdata/testinputNEW file.

@NWilson
Copy link
Member Author

NWilson commented Oct 1, 2025

Is the CI running my new tests? I didn't notice any of your comments modifying the CI or RunTest to include them. As I just lazilly put them in a totally new testdata/testinputNEW file.

I'll move those tests into one of the existing test files. Actually several - we need to test some "FASTJIT" paths and some DFA paths, which we can't do all in the same input file, because the input files are tested conditionally on whether PCRE2 was compiled with the required features.

@NWilson
Copy link
Member Author

NWilson commented Oct 2, 2025

Hi @IsaacOscar, I've finished with this PR I think.

I ended up making quite a few unrelated changes to pcre2test that appear to be unrelated, but turned out to be required or prompted by the other improvements.

I've preserved the tests of yours that I could, given the reduced number of pcre2test modifiers, and moved the remainder into the unittest() function.

Thank you again for your contribution. Please let me know if you are happy with the behaviour, documentation, tests...

@IsaacOscar
Copy link
Contributor

Thank you again for your contribution. Please let me know if you are happy with the behaviour, documentation, tests...

Everything looks great, modulo my above disagreement about PCRE2_COPY_MATCHED_SUBJECT handling. And thanks for adding PCRE2_ERROR_DIFFSUBSPATTERN which I somehow didn't think of.

@NWilson NWilson merged commit ed69a3a into master Oct 3, 2025
35 checks passed
@NWilson NWilson deleted the user/niwilson/subst-subject branch October 3, 2025 08:53
@NWilson
Copy link
Member Author

NWilson commented Oct 3, 2025

Thank you for reviewing and collaborating.

I appreciate why you might want to make PCRE2_COPY_MATCHED_SUBJECT as powerful as possible. In the long run though, I think there are benefits to minimising complex interactions between features to keep the narrative simple ("PCRE2_SUBSTITUTE_MATCHED requires you to pass in the same subject - simple as that").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants