Test validation logic fails due to trailing whitespace inconsistencies in test names

https://github.com/AffineFoundation/affinetes/blob/30f3842995f0c324a4a1404b17217c7e2d2a325c/environments/SWE-bench_Pro-os/env.py#L459

### Problem
The test validation logic in `env.py` (line 459) incorrectly fails even when all required tests pass due to whitespace inconsistencies between expected test names and actual test results.

**Current code:**
```python
passed_tests = {x["name"] for x in output["tests"] if x["status"] == "PASSED"}
test_result = (f2p | p2p) <= passed_tests
```

### Root Cause
The sets `f2p | p2p` and `passed_tests` have the same length but different elements due to trailing whitespace differences:

**Expected tests (from f2p | p2p) - missing trailing quotes/spaces:**
- `'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "week'`
- `'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "day'`
- `'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "off'`
- `'test/database.js | Test database test/database/sorted.js::Sorted Set methods test/database/sorted.js::getSortedSetRange() should work with big arrays (length > 100)'`

**Actual passed tests - have proper quotes/trailing spaces:**
- `'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "day"'`
- `'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "week"'`
- `'test/user.js | User Digest.getSubscribers should accurately build digest list given ACP default "off"'`
- `'test/database.js | Test database test/database/sorted.js::Sorted Set methods test/database/sorted.js::getSortedSetRange() should work with big arrays (length > 100) '`

### Proposed Solution
Compare set lengths instead of using subset operation, since the whitespace inconsistencies are a dataset issue:

```python
passed_tests = {x["name"] for x in output["tests"] if x["status"] == "PASSED"}
test_result = len(f2p | p2p) == len(passed_tests)
```

This approach works because:
1. The number of required tests matches the number of passed tests
2. The only difference is whitespace formatting, not actual test content
3. It's a pragmatic fix until the dataset can be corrected

### Alternative Solutions
- **Long-term fix:** Update the dataset to ensure consistent test name formatting
- **Stricter fix:** Normalize both sides by stripping whitespace before comparison

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test validation logic fails due to trailing whitespace inconsistencies in test names #11

Problem

Root Cause

Proposed Solution

Alternative Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test validation logic fails due to trailing whitespace inconsistencies in test names #11

Description

Problem

Root Cause

Proposed Solution

Alternative Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions