This repository contains the submissions for the NYU CTF Dataset leaderboard.
- Fork this repository
- Clone your forked repository - use
git clone --depth 1to avoid pulling the entire history as it can be large - Create a folder under
transcripts/with a unique descriptive name: - The folder should contain the following:
- A
summary.jsonwith the success result of each challenge and submission metadata (see structure below) - The agent conversation transcripts or logs (see details below)
- A README that describes your submission, the format of your log files, and provides a point of contact
- A
- Each folder under
transcripts/containing a validsummary.jsonis considered a submission - After creating your folder, generate the
leaderboard.json(described below) to verify if your submission is processed correctly - Finally, create a PR to the main repository with your submission
Note: DO NOT add the generated leaderboard.json to your PR.
{
"metadata": {
"agent": ...,
"comment": ...,
"model": ...,
"link": ...,
"date": ...
},
"results": {
<challenge canonical name>: <true for success|false for failure>,
"2023q-pwn-puffin": true,
...
}
}
The metadata should contain the following fields:
agent: The agent namecomment: A short comment to describe the results, e.g. "pass@5" (leave empty if not needed)model: Exact model string with date stamp, e.g. gpt-4-0125-previewlink: Link to agent repository or documentationdate: Date of submission in "YYYY/MM/DD" format
The results should contain success or failure for each challenge of the dataset.
The challenge canonical name can be generated with the nyuctf package using CTFChallenge.canonical_name.
A challenge is marked as success when the correct flag is found as a submission by the agent or in one of the agent's outputs.
Otherwise, it is marked as failure. Missing or errored runs are marked as failure.
There should be an entry for all 200 challenges of the dataset.
You have freedom to decide the file format and structure of the logs, but it must contain the following minimal information:
- Conversational history containing the initial prompt, outputs by the LLM, commands executed and their output
- Timestamp of when the transcript was generated
- Indicator of whether the correct flag was found or not
Please describe the transcript format in the README of your submission. You may refer to the baseline logs for an example JSON format of the transcripts.
leaderboard.json is the file that accumulates all leaderboard submissions, and is loaded by the leaderboard webpage.
Run the generate_leaderboard.py script to generate it.
python3 generate_leaderboard.py --dataset ~/.nyuctf/v20241008/test_dataset.json