Skip to content

Commit eef980d

Browse files
KangweiZhudarko-marinov
authored andcommitted
Add prStatusCheck.py for automated PR status updates
This script automates checking and updating PR statuses in the iDFlakies dataset ({gr/pr/py}-data.csv). It queries PR status via the GitHub API and will update tests' status from "Opened" to "Accepted" when PRs are merged, and flagging closed-but-not-merged PRs as "Unknown" for manual review. Features: - Per-file row-range filtering (--prrange, --grrange, --pyrange) - Ignore list support via ignore.csv - PR status caching to avoid redundant API calls - Handles CSV rows with embedded quotes correctly - Reads GitHub token from .env file at repo root For detailed documentation, see auto-update-dataset/python/README.md#prstatuscheckpy
1 parent a39e30f commit eef980d

File tree

4 files changed

+490
-3
lines changed

4 files changed

+490
-3
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -180,4 +180,4 @@ dmypy.json
180180
.pytype/
181181

182182
# Cython debug symbols
183-
cython_debug/
183+
cython_debug/

auto-update-dataset/python/README.md

Lines changed: 207 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,4 +55,210 @@ The script is used to find archived projects within the dataset CSV files and up
5555
[!]Need to Update (Copy the following contents and replace the corresponding line in pr-data.csv)
5656
line_number 6830:
5757
https://github.com/spinn3r/noxy,d53a49421f385c70b5abe7e8cda84ff3a7b59c71,noxy-reverse,com.spinn3r.noxy.reverse.ReverseProxyServiceTest.testRequestMetaForSuccessfulRequest,ID,Opened,https://github.com/spinn3r/noxy/pull/21,RepoDeleted
58-
```
58+
```
59+
60+
## prStatusCheck.py
61+
### Overview
62+
`prStatusCheck.py` is a script for automatically checking and updating pull request statuses in the IDoFT dataset. It queries the GitHub API to retrieve PR status information and updates the corresponding test records.
63+
64+
```bash
65+
python prStatusCheck.py --help
66+
usage: prStatusCheck.py [-h] [--prrange PRRANGE] [--grrange GRRANGE] [--pyrange PYRANGE] [--threads THREADS]
67+
68+
Update PR statuses in CSV files.
69+
70+
options:
71+
-h, --help show this help message and exit
72+
--prrange PRRANGE Range of CSV rows for pr-data.csv (e.g., 100-200). If not specified, processes all rows. Uses actual CSV row
73+
numbers (header=row 1, first data=row 2). Inclusive.
74+
--grrange GRRANGE Range of CSV rows for gr-data.csv (e.g., 100-200). If not specified, processes all rows. Uses actual CSV row
75+
numbers (header=row 1, first data=row 2). Inclusive.
76+
--pyrange PYRANGE Range of CSV rows for py-data.csv (e.g., 100-200). If not specified, processes all rows. Uses actual CSV row
77+
numbers (header=row 1, first data=row 2). Inclusive.
78+
--threads THREADS Number of threads to use for parallel processing
79+
```
80+
81+
### Setup
82+
```
83+
cd ~/idoft
84+
cd auto-update-dataset/python
85+
python -m venv .venv
86+
source .venv/bin/activate
87+
pip install -r requirements.txt
88+
```
89+
90+
### Features
91+
92+
#### 1. Query by GitHub API
93+
94+
It reads the GitHub token from the .env file. To obtain and use your own token, go to https://github.com/settings/tokens and paste it at a file named as `.env` under `idoft` project root directory. Example:
95+
```
96+
GITHUB_TOKEN=<Your github token here>
97+
```
98+
99+
#### 2. Per-File Independent Row Range Processing
100+
101+
Specify row ranges for each CSV file via command-line arguments
102+
103+
- **Arguments**:
104+
105+
- `--prrange`: Row range for pr-data.csv (e.g., `3802-3804`)
106+
107+
```bash
108+
python prStatusCheck.py --prrange 3802-3804
109+
2025-12-09 20:44:18,978 - INFO - --- Processing pr-data.csv ---
110+
2025-12-09 20:44:18,978 - INFO - Loading data from local file: $HOME/idoft/pr-data.csv
111+
2025-12-09 20:44:18,991 - INFO - Queued 3 tasks for pr-data.csv.
112+
2025-12-09 20:44:19,397 - INFO - [pr-data.csv] Row 3802: Status changed but could not be determined, remains Opened (https://github.com/apache/tinkerpop/pull/1658. Please check manually.)
113+
2025-12-09 20:44:19,410 - INFO - [pr-data.csv] Row 3803: Status changed but could not be determined, remains Opened (https://github.com/apache/tinkerpop/pull/1658. Please check manually.)
114+
2025-12-09 20:44:19,968 - INFO - [pr-data.csv] Row 3804: Status changed but could not be determined, remains Opened (https://github.com/apache/tinkerpop/pull/1658. Please check manually.)
115+
2025-12-09 20:44:19,968 - INFO - summary for pr-data.csv: 0 statuses updated, 3 changed but need manual check, 0 still open.
116+
2025-12-09 20:44:19,969 - INFO - Manual check log updated for pr-data.csv
117+
```
118+
119+
- `--grrange`: Row range for gr-data.csv (e.g., `107-108`)
120+
121+
```bash
122+
python prStatusCheck.py --grrange 107-108
123+
2025-12-09 20:43:43,978 - INFO - --- Processing gr-data.csv ---
124+
2025-12-09 20:43:43,978 - INFO - Loading data from local file: $HOME/idoft/gr-data.csv
125+
2025-12-09 20:43:43,983 - INFO - Queued 2 tasks for gr-data.csv.
126+
2025-12-09 20:43:44,574 - INFO - [gr-data.csv] Row 107: Status changed but could not be determined, remains Opened (https://github.com/apache/ignite-3/pull/4557. Please check manually.)
127+
2025-12-09 20:43:44,726 - INFO - [gr-data.csv] Row 108: Status remained Opened (https://github.com/apache/ignite-3/pull/4836)
128+
2025-12-09 20:43:44,726 - INFO - summary for gr-data.csv: 0 statuses updated, 1 changed but need manual check, 1 still open.
129+
2025-12-09 20:43:44,726 - INFO - Manual check log updated for gr-data.csv
130+
```
131+
132+
- `--pyrange`: Row range for py-data.csv (e.g., `43-43`)
133+
134+
```bash
135+
python prStatusCheck.py --pyrange 43-43
136+
2025-12-09 20:43:21,857 - INFO - --- Processing py-data.csv ---
137+
2025-12-09 20:43:21,857 - INFO - Loading data from local file: $HOME/idoft/py-data.csv
138+
2025-12-09 20:43:21,862 - INFO - Queued 1 tasks for py-data.csv.
139+
2025-12-09 20:43:22,394 - INFO - [py-data.csv] Row 43: Status changed Opened -> Accepted (https://github.com/jazzband/docopt-ng/pull/20)
140+
2025-12-09 20:43:22,395 - INFO - summary for py-data.csv: 1 statuses updated, 0 changed but need manual check, 0 still open.
141+
2025-12-09 20:43:22,404 - INFO - Accepted log updated for py-data.csv
142+
```
143+
144+
- Or use all of them together
145+
146+
```bash
147+
python prStatusCheck.py --pyrange 43-43 --grrange 107-108 --prrange 3802-3804
148+
2025-12-09 20:44:54,560 - INFO - --- Processing py-data.csv ---
149+
2025-12-09 20:44:54,560 - INFO - Loading data from local file: $HOME/idoft/py-data.csv
150+
2025-12-09 20:44:54,564 - INFO - Queued 1 tasks for py-data.csv.
151+
2025-12-09 20:44:55,044 - INFO - [py-data.csv] Row 43: Status changed Opened -> Accepted (https://github.com/jazzband/docopt-ng/pull/20)
152+
2025-12-09 20:44:55,044 - INFO - summary for py-data.csv: 1 statuses updated, 0 changed but need manual check, 0 still open.
153+
2025-12-09 20:44:55,050 - INFO - Accepted log updated for py-data.csv
154+
2025-12-09 20:44:55,050 - INFO - --- Processing pr-data.csv ---
155+
2025-12-09 20:44:55,050 - INFO - Loading data from local file: $HOME/idoft/pr-data.csv
156+
2025-12-09 20:44:55,081 - INFO - Queued 3 tasks for pr-data.csv.
157+
2025-12-09 20:44:55,493 - INFO - [pr-data.csv] Row 3802: Status changed but could not be determined, remains Opened (https://github.com/apache/tinkerpop/pull/1658. Please check manually.)
158+
2025-12-09 20:44:55,516 - INFO - [pr-data.csv] Row 3804: Status changed but could not be determined, remains Opened (https://github.com/apache/tinkerpop/pull/1658. Please check manually.)
159+
2025-12-09 20:44:55,584 - INFO - [pr-data.csv] Row 3803: Status changed but could not be determined, remains Opened (https://github.com/apache/tinkerpop/pull/1658. Please check manually.)
160+
2025-12-09 20:44:55,584 - INFO - summary for pr-data.csv: 0 statuses updated, 3 changed but need manual check, 0 still open.
161+
2025-12-09 20:44:55,584 - INFO - Manual check log updated for pr-data.csv
162+
2025-12-09 20:44:55,584 - INFO - --- Processing gr-data.csv ---
163+
2025-12-09 20:44:55,584 - INFO - Loading data from local file: $HOME/idoft/gr-data.csv
164+
2025-12-09 20:44:55,595 - INFO - Queued 2 tasks for gr-data.csv.
165+
2025-12-09 20:44:56,177 - INFO - [gr-data.csv] Row 107: Status changed but could not be determined, remains Opened (https://github.com/apache/ignite-3/pull/4557. Please check manually.)
166+
2025-12-09 20:44:56,416 - INFO - [gr-data.csv] Row 108: Status remained Opened (https://github.com/apache/ignite-3/pull/4836)
167+
2025-12-09 20:44:56,417 - INFO - summary for gr-data.csv: 0 statuses updated, 1 changed but need manual check, 1 still open.
168+
2025-12-09 20:44:56,417 - INFO - Manual check log updated for gr-data.csv
169+
```
170+
171+
- **Behavior**:
172+
173+
- If any range is specified: Only files with specified ranges are processed; others are skipped
174+
- If no range is specified: All three files are processed in full
175+
- Row numbers use actual CSV row numbers (header = row 1, first data = row 2)
176+
177+
#### 3. PR Status Query
178+
179+
There are three status mappings defined in this script:
180+
181+
- `state == "open"`"Opened"
182+
- `state == "closed" && merged == true`"Accepted"
183+
- `state == "closed" && merged == false`"Unknown"
184+
185+
A pull request can be closed without being merged for various reasons. For example, it may be marked as *DeveloperFixed*, *Rejected*, or fall into other flaky test statuses defined in IDoFT. In some cases, the changes are actually merged through an alternative workflow. Since these situations cannot be reliably distinguished automatically, such pull requests are classified as unknown and logged to `manual-check.log` for further inspection.
186+
187+
##### Output File Description
188+
189+
| File | Description |
190+
| --------------------------- | ----------------------------------------------------------- |
191+
| `accepted.log` | Records successfully updated to Accepted |
192+
| `manual-check.log` | Records requiring manual check (Unknown or other anomalies) |
193+
| `pr-status-update.log` | Complete runtime log |
194+
| `../../{pr,gr,py}-data.csv` | Updated data files |
195+
196+
---
197+
198+
#### 4. Ignore List Support
199+
200+
To exclude specific tests, create a `ignore.csv` file under `idoft/auto-update-dataset`. Ignored Python and Java tests can coexist in the same file.
201+
202+
* Example `ignore.csv`
203+
204+
```csv
205+
name
206+
tk.mybatis.mapper.mapperhelper.FieldHelperTest.testUser
207+
tests/test_converter.py::TestConverter::test_to_idna_multiple_urls
208+
```
209+
210+
##### 4.1 Example For Java
211+
212+
Assume that the 4th and 5th lines of `pr-data.csv` are as follows:
213+
214+
```csv
215+
https://github.com/abel533/Mapper,1764748eedb2f320a0d1c43cb4f928c4ccb1f2f5,core,tk.mybatis.mapper.mapperhelper.FieldHelperTest.testComplex,ID,Accepted,https://github.com/abel533/Mapper/pull/896,Accepted in the PR https://github.com/abel533/Mapper/pull/666 but later reverted in the commit https://github.com/abel533/Mapper/commit/79d313a7ca6cba6c5d5323746fb83ed5744180a1
216+
https://github.com/abel533/Mapper,1764748eedb2f320a0d1c43cb4f928c4ccb1f2f5,core,tk.mybatis.mapper.mapperhelper.FieldHelperTest.testUser,ID,Opened,https://github.com/abel533/Mapper/pull/896,Accepted in the PR https://github.com/abel533/Mapper/pull/666 but later reverted in the commit https://github.com/abel533/Mapper/commit/79d313a7ca6cba6c5d5323746fb83ed5744180a1
217+
```
218+
219+
The output should match the following. Test on the 5th line is not processed.
220+
221+
```bash
222+
python prStatusCheck.py --prrange 4-5
223+
2025-12-09 21:07:20,243 - INFO - Loading ignore list from $HOME/idoft/auto-update-dataset/ignore.csv
224+
2025-12-09 21:07:20,245 - INFO - --- Processing pr-data.csv ---
225+
2025-12-09 21:07:20,245 - INFO - Loading data from local file: $HOME/idoft/pr-data.csv
226+
2025-12-09 21:07:20,255 - INFO - Processing CSV rows 4-5
227+
2025-12-09 21:07:20,256 - INFO - Queued 1 tasks for pr-data.csv.
228+
2025-12-09 21:07:20,776 - INFO - [pr-data.csv] Row 4: Status changed Opened -> Accepted (https://github.com/abel533/Mapper/pull/896)
229+
2025-12-09 21:07:20,777 - INFO - summary for pr-data.csv: 1 statuses updated, 0 changed but need manual check, 0 still open.
230+
2025-12-09 21:07:20,796 - INFO - Accepted log updated for pr-data.csv
231+
```
232+
233+
##### 4.2 Example for Python
234+
235+
Assume that rows between 1022 and 1024 of `pr-data.csv` are as follows:
236+
237+
```csv
238+
https://github.com/PyFunceble/domain2idna,39a1c4e1ebb877ed511e53b618fbe437a685c970,tests/test_converter.py::TestConverter::test_to_idna_multiple,OD-Vic,Opened,https://github.com/PyFunceble/domain2idna/pull/4,
239+
https://github.com/PyFunceble/domain2idna,39a1c4e1ebb877ed511e53b618fbe437a685c970,tests/test_converter.py::TestConverter::test_to_idna_multiple_urls,OD-Vic,Opened,https://github.com/PyFunceble/domain2idna/pull/4,
240+
https://github.com/PyFunceble/domain2idna,39a1c4e1ebb877ed511e53b618fbe437a685c970,tests/test_converter.py::TestConverter::test_to_idna_single,OD-Vic,Opened,https://github.com/PyFunceble/domain2idna/pull/4,
241+
```
242+
243+
The output should match the following. Test on 1023 line is not processed.
244+
245+
```bash
246+
python prStatusCheck.py --pyrange 1022-1024
247+
2025-12-09 21:59:08,624 - INFO - Loading ignore list from $HOME/idoft/auto-update-dataset/ignore.csv
248+
2025-12-09 21:59:08,626 - INFO - --- Processing py-data.csv ---
249+
2025-12-09 21:59:08,626 - INFO - Loading data from local file: $HOME/idoft/py-data.csv
250+
2025-12-09 21:59:08,629 - INFO - Processing CSV rows 1022-1024
251+
2025-12-09 21:59:08,630 - INFO - Queued 2 tasks for py-data.csv.
252+
2025-12-09 21:59:09,000 - INFO - [py-data.csv] Row 1022: Status changed Opened -> Accepted (https://github.com/PyFunceble/domain2idna/pull/4)
253+
2025-12-09 21:59:09,018 - INFO - [py-data.csv] Row 1024: Status changed Opened -> Accepted (https://github.com/PyFunceble/domain2idna/pull/4)
254+
2025-12-09 21:59:09,018 - INFO - summary for py-data.csv: 2 statuses updated, 0 changed but need manual check, 0 still open.
255+
2025-12-09 21:59:09,023 - INFO - Accepted log updated for py-data.csv
256+
```
257+
258+
#### 5. Data Source
259+
260+
- Prefers local CSV files if they exist (`../../{filename}` relative to script, i.e., `~/idoft/*.csv`)
261+
- Falls back to GitHub remote repository if local file not found
262+
- Writes updates to local files
263+
264+
---

0 commit comments

Comments
 (0)