Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve preg_split() function ReturnType #3757

Open
wants to merge 33 commits into
base: 2.1.x
Choose a base branch
from

Conversation

malsuke
Copy link

@malsuke malsuke commented Dec 26, 2024

I developed further extended the extension for the preg_split() function.

The preg_split() function can specify ReturnType with the following cases:

  • When the regular expression passed as the $pattern argument is invalid.
  • When the $pattern and $subject arguments are constants.
  • When the string passed as the $subject argument is non-empty-string.
  • When the $flag argument is set to one or more of the following: PREG_SPLIT_OFFSET_CAPTURE, PREG_SPLIT_NO_EMPTY, or PREG_SPLIT_DELIM_CAPTURE.

The detailed cases are specified in the test cases.

@malsuke malsuke marked this pull request as draft December 26, 2024 07:53
@malsuke malsuke marked this pull request as ready for review December 26, 2024 08:13
@phpstan-bot
Copy link
Collaborator

This pull request has been marked as ready for review.

} else {
$flagType = $scope->getType($flagArg->value);
$flags = $flagType->getConstantScalarValues();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By replacing it as follows, type checking within multiple Constant loops will no longer be necessary.

$flags = [];
$flagType = $scope->getType($flagArg->value);
foreach ($flagType->getConstantScalarValues() as $flag) {
    if (!is_int()) {
        return new ErrorType();
    }

    $flags[] = $flag;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved 8cb3030

@@ -9081,7 +9081,7 @@
'preg_replace' => ['string|array|null', 'regex'=>'string|array', 'replace'=>'string|array', 'subject'=>'string|array', 'limit='=>'int', '&w_count='=>'int'],
'preg_replace_callback' => ['string|array|null', 'regex'=>'string|array', 'callback'=>'callable(array<int|string, string>):string', 'subject'=>'string|array', 'limit='=>'int', '&w_count='=>'int'],
'preg_replace_callback_array' => ['string|array|null', 'pattern'=>'array<string,callable>', 'subject'=>'string|array', 'limit='=>'int', '&w_count='=>'int'],
'preg_split' => ['list<string>|false', 'pattern'=>'string', 'subject'=>'string', 'limit='=>'?int', 'flags='=>'int'],
'preg_split' => ['__benevolent<list<string>|list<array{string, int<0, max>}>|false>', 'pattern'=>'string', 'subject'=>'string', 'limit='=>'?int', 'flags='=>'int'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other preg_ method are not using benevolent union, so I would think more consistent to not use a benevolent union here too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VincentLanglet @ondrejmirtes

I understand. I would like to remove benevolent union.

On the other hand, I think that preg_split should not return false unless there is an issue with the regular expression. Furthermore, in this PR, I have modified the code so that if the regular expression is incorrect, an error is returned early in the parsing process.

Therefore, if the regular expression is correct, I am considering not adding false as a Union.
(In this case, this bug can also be fixed.

public function testBug7554(): void
{
$errors = $this->runAnalyse(__DIR__ . '/data/bug-7554.php');
$this->assertCount(2, $errors);
$this->assertSame(sprintf('Parameter #1 $%s of function count expects array|Countable, list<array<int, int<0, max>|string>>|false given.', PHP_VERSION_ID < 80000 ? 'var' : 'value'), $errors[0]->getMessage());
$this->assertSame(26, $errors[0]->getLine());
$this->assertSame('Cannot access offset int<1, max> on list<array{string, int<0, max>}>|false.', $errors[1]->getMessage());
$this->assertSame(27, $errors[1]->getLine());
}
)

If you think not to use benevolent union, do you think it would be fine to remove false? I would like to hear your opinion on this. I would like to get your opinion before making any modifications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like any other preg method I think it can returns false if an internal error occurs like

  • some memory limit is reached
  • too many recursion
  • some invalid encoding

And in the pho.ini there is some config like pcre.recursion_limit or pcre.backtrack_limit.

So I would keep a non-benevolent union AND false.

If we decide to remove false from the signature it should be removed from all the preg methods. But I dont think we should go this way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand.
I modified it to keep a non-benevolent union AND false.

37f9b3e

@ondrejmirtes ondrejmirtes changed the base branch from 2.0.x to 2.1.x March 1, 2025 15:20
@ondrejmirtes
Copy link
Member

@staabm As this is about regexes, can you please review it? Thank you.


if (
count($patternConstantTypes) > 0
&& @preg_match($patternConstantTypes[0]->getValue(), '') === false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually us Strings::match

Comment on lines +60 to +61
$subjectType = $scope->getType($subjectArg->value);
$subjectConstantTypes = $subjectType->getConstantStrings();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the subject type declartion below the early returning if in the next block

$flagType = $scope->getType($flagArg->value);
foreach ($flagType->getConstantScalarValues() as $flag) {
if (!is_int($flag)) {
return new ErrorType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case is missing a test

$limitType = $scope->getType($limitArg->value);
foreach ($limitType->getConstantScalarValues() as $limit) {
if (!is_int($limit)) {
return new ErrorType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case is missing a test

foreach ($subjectConstantTypes as $subjectConstantType) {
foreach ($limits as $limit) {
foreach ($flags as $flag) {
$result = @preg_split($patternConstantType->getValue(), $subjectConstantType->getValue(), $limit, $flag);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Strings::split

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using Strings::split here is not right because the limit is fixed to -1.

} else {
$limitType = $scope->getType($limitArg->value);
foreach ($limitType->getConstantScalarValues() as $limit) {
if (!is_int($limit)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numeric-string $limit is not an error

https://3v4l.org/JuFHj


if (
count($patternConstantTypes) > 0
&& @preg_match($patternConstantTypes[0]->getValue(), '') === false
Copy link
Contributor

@staabm staabm Mar 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to check all patterns not only the first

https://3v4l.org/495b1

} else {
$flagType = $scope->getType($flagArg->value);
foreach ($flagType->getConstantScalarValues() as $flag) {
if (!is_int($flag)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with limit, this might also allow numeric-string

https://3v4l.org/PqFaA

}

return null;
return TypeCombinator::union(...$resultTypes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing false. in the preg_match inference we decided this can get false even if all args a valid and static analysis time known, because a regex pattern might be super inefficient (or pattern based attacks might trick the regex engine into return false)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the above comment is still true and we are missing the false here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, does this mean that every possible result of preg_split includes the possibility of false, and therefore we need to add false to the union type?

I had implemented it to return an Error if preg_split returns false, as a warning.
So, does this mean I should include false in all cases, instead of returning Error?

Copy link
Contributor

@staabm staabm Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type of false at runtime is necessary because the preg_split call can fail even if we know everything IIRC.

The current "return ErrorType" could be turned into "return null" in case other rules will already report a phpstan error for the code examples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed about that in the following commit.
307cf54

Additionally, since handling for the false case is no longer necessary, I have removed if ($result === false).

}
}

if (count($patternConstantTypes) === 0 || count($subjectConstantTypes) === 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if-branch might be factored out into a private method for readability

Comment on lines 152 to 153
if ($result === false) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if one of the static analysis time values make preg_split return false we should give-up instead of ignoring this fact

@malsuke
Copy link
Author

malsuke commented Mar 7, 2025

@staabm
Thanks for your review.
I've made revisions based on the feedback. Please review again.

Copy link
Contributor

@staabm staabm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I set __benevolent<list|list<array{string, int<0, max>}>|false> to the basic return type.

it seems this is no longer true

$limits = [-1];
} else {
$limitType = $scope->getType($limitArg->value);
if (!$limitType->isInteger()->yes() && !$limitType->isString()->yes()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this won't work for a limit like 5|'17'. could work with ->toInteger() beforehand.

same for $flag

please add tests for this cases

Copy link
Author

@malsuke malsuke Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that using toInteger() beforehand wasn't ideal because it would convert array, null or some inputs into int(0), which is incorrect. So, I added a check using isConstantScalarValue() to ensure we can accurately handle values like 5|'14'.


$returnInternalValueType = $returnStringType;
if ($flagArg !== null) {
$flagState = $this->bitwiseFlagAnalyser->bitwiseOrContainsConstant($flagArg->value, $scope, 'PREG_SPLIT_OFFSET_CAPTURE');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$flagState = $this->bitwiseFlagAnalyser->bitwiseOrContainsConstant($flagArg->value, $scope, 'PREG_SPLIT_OFFSET_CAPTURE');
$capturesOffset = $this->bitwiseFlagAnalyser->bitwiseOrContainsConstant($flagArg->value, $scope, 'PREG_SPLIT_OFFSET_CAPTURE');

Comment on lines 106 to 107
$returnNonEmptyStrings = $flagArg !== null && $this->bitwiseFlagAnalyser->bitwiseOrContainsConstant($flagArg->value, $scope, 'PREG_SPLIT_NO_EMPTY')->yes();
if ($returnNonEmptyStrings) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would inline this only once used variable to ease reading the code

}

return null;
return TypeCombinator::union(...$resultTypes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the above comment is still true and we are missing the false here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants