In this experiment I'm investigating different scenarios that can be solved for Github Issues using machine learning algorithms. I will be using issues for phpmyadmin/phpmyadmin. It has around 268 open issue and 10,768 closed issues - i.e. total 11K issues raised till date. The data was mined using a quick hack I wrote an year back - mebjas/gils. It creates a dump of serialized JSON objects per line indicating one issue.
All preprocessing was done using ./preprocess.py. Around 13730 rows were extracted. And some of them were duplicates, so upon deduplication 279 entires were removed leaving 13450 unique issues. The output of the data is stored in ./data/data.csv
Workbook.xlsx has some of the analysis in excel sheet.
| field_name | remarks |
|---|---|
| repo_id | Same for all rows. No impact on any decision |
| repo_name | Same for all rows. No impact on any decision |
| id | Unique ID of an issues |
| title | title of the issue (string) |
| body | body of the issue (text) |
| created_by | Github Handle of the user who created the issue. 962 unique values. |
| created_at | time string |
| updated_at | time string |
| closed_at | time string, empty for open issues |
| state | State of the issue (open |
| n_comments | no of comments on the issue (int) |
| locked | |
| milestone | NA for all values, Serialized JSON with milestone information |
| labels | labels set on a issue - label text is kept as space separated string |
| field_name | remarks |
|---|---|
| actual_author | Around 9507 rows were created by Github User pma-import. This is a bot which imports issues from sourceforge PMA page. It follows a standard template of body, from which actual author name {sourceforge ID} can be mined. |
After mining actual author information from issue body, the distribution changes. 4318 unqieu authors were identified. However 1294 of them were created by *anonymous which is apparently introduced by sourceforge. 11 authors seem to have created >100 issues. They are:
| author | issues created | Is Code Contributor? |
|---|---|---|
| nijel | 507 | true |
| lem9 | 432 | true |
| madhuracj | 364 | true |
| ibennetch | 319 | true |
| OlafvdSpek | 272 | |
| devenbansod | 139 | |
| ryandesign | 127 | |
| tithugues | 117 | |
| adamgsoc2013 | 114 | |
| windkiel | 110 | |
| xmujay | 106 | |
| roccivic | 104 |
So these folks would be our person of interest.