This document provides a comprehensive workflow for agents to collect more instances and create datasets similar to multi_swe_bench/collect/example.jsonl.
Multi-SWE-bench is a multilingual benchmark for evaluating LLMs in real-world code issue resolution. The data collection process involves finding GitHub repositories, extracting pull requests that resolve issues, and creating structured datasets with code patches and test results.
# Install dependencies
pip install -r requirements.txt
# Ensure Docker is installed for evaluation
docker --version- Obtain GitHub Personal Access Tokens for API access
- Store tokens in a file or provide as command-line arguments
- Multiple tokens recommended for higher rate limits
jqfor JSON processinggitfor repository operationsdockerfor containerized evaluation
Use crawl_repos.py to discover popular repositories in specific programming languages:
python multi_swe_bench/collect/crawl_repos.py \
--language java \
--min_stars 1000 \
--max_results 500 \
--output csv \
--token <your_github_token> \
--output_dir ./repo_listsParameters:
--language: Programming language (java, typescript, javascript, go, rust, c, cpp)--min_stars: Minimum star count for repositories--max_results: Maximum number of repositories to fetch--output: Output format (csv or console)--token: GitHub API token--output_dir: Directory to save CSV files
Output: CSV file with repository names, stars, forks, descriptions, and URLs.
Create a CSV file with repository names in the format:
Name
org1/repo1
org2/repo2For processing individual repositories, use get_pipeline.py:
python multi_swe_bench/collect/get_pipeline.py \
--out_dir ./output \
--tokens <token1> <token2> \
--org alibaba \
--repo fastjson2 \
--delay-on-error 300 \
--retry-attempts 3 \
--skip-commit-message falseParameters:
--out_dir: Output directory for processed data--tokens: GitHub API tokens (multiple tokens for better rate limits)--org: GitHub organization name--repo: Repository name--delay-on-error: Delay in seconds before retrying on error--retry-attempts: Number of retry attempts--skip-commit-message: Skip commit message processing
Pipeline Steps:
- Get All PRs: Fetches all pull requests from the repository
- Filter PRs: Filters for closed PRs that resolve issues
- Get Related Issues: Retrieves issues referenced by the PRs
- Merge Data: Combines PR and issue information
- Build Dataset: Downloads patches and creates final dataset
For processing multiple repositories from a CSV file, use get_from_repos_pipeline.py:
python multi_swe_bench/collect/get_from_repos_pipeline.py \
--csv_file ./repo_lists/github_java_repos_20250725_120000.csv \
--out_dir ./batch_output \
--max_workers 4 \
--distribute round \
--tokens <token1> <token2> <token3> \
--delay-on-error 300 \
--retry-attempts 3Parameters:
--csv_file: Path to CSV file with repository names--out_dir: Base output directory--max_workers: Maximum concurrent processes--distribute: Distribution strategy (round or chunk)--tokens: Multiple GitHub API tokens--delay-on-error: Error retry delay--retry-attempts: Number of retry attempts
Distribution Strategies:
round: Round-robin distribution of repositories among tokenschunk: Equal chunks distribution among tokens
After processing, validate the collected data:
# Verify output directory structure
ls -la ./output/org__repo/
# Expected files:
# - org__repo_prs.jsonl (all PRs)
# - org__repo_filtered_prs.jsonl (filtered PRs)
# - org__repo_filtered_prs_with_issues.jsonl (PRs with issues)
# - org__repo_dataset.jsonl (final dataset)# Check final dataset format
head -1 ./output/org__repo/org__repo_dataset.jsonl | jq .
# Verify required fields:
# - org, repo, number, state, title, body
# - base (label, ref, sha)
# - resolved_issues (array with number, title, body)
# - fix_patch, test_patch
# - fixed_tests, failed_tests, skipped_tests
# - instance_id# Count total instances
wc -l ./output/org__repo/org__repo_dataset.jsonl
# Check for instances with both fix and test patches
jq 'select(.fix_patch != null and .test_patch != null)' ./output/org__repo/org__repo_dataset.jsonl | wc -l
# Verify test results exist
jq 'select(.fixed_tests != null)' ./output/org__repo/org__repo_dataset.jsonl | wc -lCombine datasets from multiple repositories:
# Create consolidated dataset
mkdir -p ./final_dataset
# Combine all datasets
find ./batch_output -name "*_dataset.jsonl" -exec cat {} \; > ./final_dataset/multi_swe_bench_new.jsonl
# Count total instances
wc -l ./final_dataset/multi_swe_bench_new.jsonl
# Validate consolidated dataset
head -5 ./final_dataset/multi_swe_bench_new.jsonl | jq .Each instance in the final dataset should contain:
{
"org": "string", // GitHub organization
"repo": "string", // Repository name
"number": "integer", // PR number
"state": "string", // PR state (should be "closed")
"title": "string", // PR title
"body": "string", // PR description
"base": { // Base branch information
"label": "string", // Branch label
"ref": "string", // Branch reference
"sha": "string" // Commit SHA
},
"resolved_issues": [ // Array of resolved issues
{
"number": "integer", // Issue number
"title": "string", // Issue title
"body": "string" // Issue description
}
],
"fix_patch": "string", // Git diff for the fix
"test_patch": "string", // Git diff for tests
"fixed_tests": { // Test results after fix
"test_class_name": {
"run": "PASS|FAIL",
"test": "PASS|FAIL|NONE",
"fix": "PASS|FAIL"
}
},
"failed_tests": ["string"], // List of failed test names
"skipped_tests": ["string"], // List of skipped test names
"instance_id": "string" // Unique identifier (org__repo-number)
}- Focus on active repositories with recent commits
- Prefer repositories with good test coverage
- Select repositories from different domains/use cases
- Consider language-specific popular frameworks and libraries
- Use multiple GitHub tokens to increase rate limits
- Implement proper delays between requests
- Monitor rate limit headers in API responses
- Use exponential backoff for retries
- Log all errors with context information
- Implement retry mechanisms for transient failures
- Skip repositories that consistently fail
- Validate data at each pipeline step
- Filter for PRs that actually resolve issues
- Ensure patches contain meaningful code changes
- Verify test patches exist and are relevant
- Remove duplicate instances across repositories
- Use consistent directory naming conventions
- Keep intermediate files for debugging
- Implement checkpointing for long-running processes
- Compress large datasets for storage efficiency
# Error: API rate limit exceeded
# Solution: Add more tokens or increase delays
--tokens token1 token2 token3 --delay-on-error 600# Error: fix_patch or test_patch is null
# Check: PR actually contains code changes
jq 'select(.fix_patch == null)' dataset.jsonl# Error: fixed_tests is empty
# Check: Repository has proper test infrastructure
# Verify: Docker environment can run tests# Error: Out of memory during processing
# Solution: Process repositories in smaller batches
--max_workers 2 # Reduce concurrent processes# Check API token validity
curl -H "Authorization: token <your_token>" https://api.github.com/rate_limit
# Validate JSON format
jq empty dataset.jsonl # Should produce no output if valid
# Check for duplicate instances
jq -r '.instance_id' dataset.jsonl | sort | uniq -d
# Analyze patch sizes
jq -r '.fix_patch | length' dataset.jsonl | sort -n | tail -10- Use multiple workers for batch processing
- Distribute repositories across different tokens
- Process different languages simultaneously
- Cache API responses to avoid redundant requests
- Store intermediate results for resumability
- Use local git clones for patch extraction
- Apply early filtering to reduce processing overhead
- Skip repositories with insufficient activity
- Filter out PRs without proper issue references
Before finalizing the dataset, ensure:
- All instances have required fields
- Instance IDs are unique across the dataset
- Fix patches contain actual code changes
- Test patches are present and relevant
- Test results are properly formatted
- No sensitive information is included
- Dataset size is reasonable for intended use
- Data distribution across languages is balanced
After collecting the dataset:
- Evaluation Setup: Prepare Docker environments for testing
- Baseline Testing: Run existing models on the new dataset
- Quality Assessment: Manual review of sample instances
- Documentation: Update dataset documentation and metadata
- Publication: Share dataset with the community
- GitHub Issues: Report bugs and request features
- Documentation: Refer to existing Multi-SWE-bench documentation
- Community: Join Discord for discussions and support
- Examples: Study existing datasets for reference patterns
This workflow provides a comprehensive guide for collecting high-quality Multi-SWE-bench instances. Follow the steps systematically and validate data quality at each stage to ensure the resulting dataset meets the benchmark standards.