HALLMARK

HALLucination benchMARK: A benchmark for evaluating citation hallucination detection tools.

Why HALLMARK?

The NeurIPS 2025 incident---where 53 papers were found to contain fabricated citations that passed peer review---exposed a critical gap: we have no standardized way to measure how well tools detect citation hallucinations. HALLMARK fills this gap.

HALLMARK draws on best practices from established benchmarks:

HumanEval: Multi-criteria sub-tests per entry (~6 checks per citation)
SWE-bench: Contamination awareness via temporal segmentation
LiveCodeBench: Continuous updates and post-cutoff evaluation
ONEBench: Sample-level atomic evaluation with ever-expanding pool

Features

Hallucination taxonomy: 14 types across 3 difficulty tiers (Easy / Medium / Hard)
2,525 annotated entries: 773 valid (from DBLP) + 1,177 hallucinated with ground truth (public splits)
6 sub-tests per entry: DOI resolution, title matching, author consistency, venue verification, field completeness, cross-database agreement
Evaluation metrics: Detection Rate, F1, tier-weighted F1, detect@k, ECE
Built-in baselines: DOI-only, bibtex-updater, HaRC, verify-citations, LLM-based (OpenAI, Anthropic, OpenRouter), agentic LLMs with tool use, ensemble, DB-first cascade with hallucination-mode diagnosis, plus ports of two recent papers — hallucitechecker (Sakai et al. 2026) and checkifexist (Abbonato 2026 Algorithm 1) (CiteVerifier and hallucinator are available as wrapper modules but not registered in the default registry)
Baseline registry: Central discovery, availability checking, and dispatch for all baselines (19+ variants)
Reproducible runs: opt-in --cache-path flag wraps HTTP calls in a SQLite-backed requests-cache so re-runs reuse frozen API responses; --timing-breakdown and --subtask-diagnostic surface per-baseline performance + recognition/matching/calibration decomposition
Plackett-Luce ranking: ONEBench-inspired ranking that handles incomplete evaluation data
Automated execution: Orchestrator script and CI workflow for batch baseline evaluation
Temporal analysis: Contamination detection via pre/post-cutoff comparison
Community contributions: ONEBench-style ever-expanding sample pool

Headline cascade results (v1.1)

cascade_db_diagnosis — Stage 1 bibtex-updater + Stage 2 Claude Sonnet 4.6 (via OpenRouter, up to 5 tool calls), conservative vs aggressive scoring of residual UNCERTAIN:

Split	Mode	DR	FPR	F1	Tier-3 F1	AUROC
`dev_public`	conservative	0.976	0.559	0.760	0.417	0.833
`dev_public`	aggressive	0.983	0.560	0.815	0.570	0.740
`test_public`	conservative	0.972	0.456	0.854	0.596	0.867
`test_public`	aggressive	0.978	0.456	0.882	0.707	0.805
`stress_test`	conservative	0.969	—	0.985	0.983	—
`stress_test`	aggressive	0.975	—	0.987	0.986	—

Aggressive promotion of residual UNCERTAIN (the "DB-as-gold-standard" stance) lifts Tier-3 F1 by +11.1 pp on test_public and +15.3 pp on dev_public at ≤0.1 pp FPR cost; the trade is paid in rank-discrimination (AUROC −6.2 / −9.3 pp). Runner-level (cascade_db_diagnosis_aggressive) and evaluator-level (--eval-mode aggressive) promotion paths agree to within ~1 pp on every metric. Full JSONs (incl. per-tier/per-type breakdowns) in data/v1.0/baseline_results/; see paper §Stage-2 diagnosis cascade for analysis.

Installation

# Recommended: clone and install in development mode
git clone https://github.com/rpatrik96/hallmark.git
cd hallmark
uv pip install -e ".[dev]"

# With LLM baseline SDKs (openai, anthropic)
uv pip install -e ".[baselines]"

# With ranking support (Plackett-Luce model via choix)
uv pip install -e ".[ranking]"

# All optional dependencies
uv pip install -e ".[all]"

Note: pip install hallmark is not yet published to PyPI. Use the clone + install path above.

Baseline Installation Guide

The [baselines] extra installs only the LLM SDKs (openai, anthropic). External CLI tools require separate installation due to a bibtexparser 1.x dependency conflict:

# HaRC
pipx install harcx

# bibtex-updater (released HALLMARK numbers use tag v1.2.0)
pipx install "bibtex-updater==1.2.0"

# verify-citations
pipx install verify-citations

# CiteVerifier (GhostCite) — clone required
git clone https://github.com/NKU-AOSP-Lab/CiteVerifier

# hallucinator — clone required
git clone https://github.com/gianlucasb/hallucinator

Using pipx isolates each tool's bibtexparser 1.x from your project environment.

Quick Start

Evaluate a built-in baseline

# Run DOI-only baseline on the dev split
hallmark evaluate --split dev_public --baseline doi_only

# Run the v1.1 cascade with aggressive scoring (DB as gold standard).
# Stage 2 LLM diagnoser is routed through OpenRouter — set OPENROUTER_API_KEY.
hallmark evaluate --split dev_public --baseline cascade_db_diagnosis_aggressive \
    --stage2-baseline llm_agentic_openrouter_claude_sonnet_4_6

# Re-score the same predictions under both eval modes (conservative + aggressive)
# in a single payload — the gap quantifies the abstention/indexing-lag tax.
hallmark evaluate --split dev_public --baseline cascade_db_diagnosis \
    --stage2-baseline llm_agentic_openrouter_claude_sonnet_4_6 \
    --eval-mode both

# Run with custom predictions
hallmark evaluate --split dev_public --predictions my_predictions.jsonl --tool-name my-tool

Show dataset statistics

hallmark stats --split dev_public

Run all baselines at once

# Run all free baselines and generate leaderboard
python scripts/run_all_baselines.py --split dev_public --output-dir results/

# Run specific baselines in parallel
python scripts/run_all_baselines.py --baselines doi_only,bibtexupdater --parallel

# Run only free (no API key) baselines, skip unavailable
python scripts/run_all_baselines.py --baselines free --skip-unavailable

Resume long-running LLM evaluations in parallel

For LLM-based baselines that take >1 hour sequentially, use the parallel-resume scripts to checkpoint and resume:

# Resume zero-shot OpenRouter LLM baselines across multiple processes
python scripts/parallel_resume_test_public.py --split test_public --num-workers 4

# Resume agentic verifiers (BTU, multi-tool, tool-augmented) with Sonnet 4.6
python scripts/parallel_agentic_btu_test_public.py --split test_public --verifier agentic_btu_openai

Both scripts support checkpointing and can safely resume interrupted runs without recomputing completed entries.

View the leaderboard

hallmark leaderboard --results-dir results/

See examples/ for full walkthroughs, including writing a custom baseline and per-type analysis.

Evaluate Your Tool

To evaluate any external tool against HALLMARK, produce a JSONL file with one prediction per line and run:

hallmark evaluate --predictions my_preds.jsonl --split dev_public

Each prediction must include:

{
  "bibtex_key": "a3f9c2b1...",
  "label": "HALLUCINATED",
  "confidence": 0.87,
  "reason": "DOI does not resolve",
  "subtest_results": {"doi_resolves": false},
  "api_sources_queried": ["crossref"],
  "wall_clock_seconds": 1.2,
  "api_calls": 1
}

bibtex_key format: Keys in the benchmark are hex hashes (e.g., a3f9c2b1d4e7...), not human-readable keys like vaswani2017attention. Your predictions must use the exact keys from the loaded entries — use entry.bibtex_key when iterating over load_split() results.

See examples/03_custom_baseline.py for a complete end-to-end example.

Prediction Fields

Field	Required	Affects
`bibtex_key`	Yes	Entry matching
`label`	Yes	All metrics
`confidence`	Yes	ECE, AUROC, AUPRC
`reason`	No	Diagnose output
`subtest_results`	No	Subtest accuracy
`api_sources_queried`	No	Source-stratified metrics
`wall_clock_seconds`	No	Cost efficiency
`api_calls`	No	Mean API calls

UNCERTAIN label: UNCERTAIN is accepted as a prediction label. UNCERTAIN predictions are treated as VALID for confusion-matrix metrics (conservative default) and excluded from AUROC/AUPRC. Prefer VALID or HALLUCINATED with calibrated confidence when possible.

Confidence semantics: confidence = P(your predicted label is correct). If you predict HALLUCINATED with 0.9, you claim 90% certainty it is hallucinated. If you predict VALID with 0.8, you claim 80% certainty it is valid. This is NOT P(HALLUCINATED).

Hallucination Taxonomy

Tier 1: Easy (detectable by simple API lookup)

Type	Description	Example
`fabricated_doi`	DOI that doesn't resolve	`doi = {10.9999/fake.2024.001}`
`nonexistent_venue`	Invented journal/conference	`booktitle = {Intl. Conf. on Advanced AI Systems}`
`placeholder_authors`	Generic/fake author names	`author = {John Doe and Jane Smith}`
`future_date`	Publication year in the future	`year = {2030}`

Tier 2: Medium (requires cross-referencing metadata)

Type	Description	Example
`chimeric_title`	Real author + fabricated title	Real authors, plausible but non-existent paper
`wrong_venue`	Real paper, wrong venue/year	Correct title but at ICML not NeurIPS
`author_mismatch`	Author list swapped or fabricated (data value: `swapped_authors`)	Correct title, wrong author list
`preprint_as_published`	arXiv paper cited as venue paper	Correct paper, fabricated venue acceptance
`hybrid_fabrication`	Real DOI + fabricated metadata	Valid DOI resolves but authors/title don't match
`merged_citation`	Metadata from 2-3 papers merged	Authors from paper A, title from paper B
`partial_author_list`	Subset of real author list	First and last author only, middle dropped

Tier 3: Hard (requires deep verification)

Type	Description	Example
`near_miss_title`	Title off by 1-2 words	"Attention Is All You Want" vs "...Need"
`plausible_fabrication`	Entirely fabricated but realistic	Realistic author + plausible title
`arxiv_version_mismatch`	Mixed preprint/published metadata	arXiv ID with conference venue claim

Hosting & Croissant

The dataset is mirrored on HuggingFace (parquet + jsonl + baseline results + RAI Croissant metadata): https://huggingface.co/datasets/hallmark-neurips2026/HALLMARK

A Croissant 1.0 metadata file is included at the repo root (croissant.json). It covers all public splits and includes RAI fields required by NeurIPS 2026 D&B. Validate locally with:

mlcroissant validate --jsonld croissant.json

The data is also shipped in data/v1.0/ for direct repo-relative access without any external download.

Dataset

Splits

Split	Valid	Hallucinated	Total	Purpose
`dev_public`	513	606	1,119	Development and tuning
`test_public`	312	519	831	Public leaderboard
`test_hidden`	—	—	454	Anti-gaming evaluation
`stress_test`	1	121	122	Stress-test types depth

stress_test design note: The stress_test split is all-hallucinated by design. It contains challenging edge cases (merged citations, partial author lists, arXiv version mismatches) intended to stress-test detection robustness beyond the main splits. Because there are no valid entries, FPR and specificity are undefined for this split. Use detection rate as the primary metric when reporting stress_test results.

Tier distribution per split: ~27% Tier 1, ~47% Tier 2, ~26% Tier 3 (hallucinated entries).

Subtest Definitions

Subtest	Definition
`doi_resolves`	DOI returns HTTP 200 from doi.org (redirects count as resolved)
`title_exists`	Title found in Semantic Scholar or DBLP via exact or fuzzy match (threshold 0.9)
`authors_match`	Author last names match the record retrieved via DOI or title lookup
`venue_correct`	The venue/journal is correct for this specific paper (not just "a real venue")
`fields_complete`	All standard BibTeX fields for this entry type are present and non-empty
`cross_db_agreement`	Metadata from DOI resolution matches metadata from title/author search in DBLP/S2

Data Format

Each entry is a JSON object in JSONL format:

bibtex_key format: Keys are hex hashes (e.g., a3f9c2b1d4e7...), not human-readable keys. When writing predictions, always use entry.bibtex_key directly — do not construct keys manually.

{
  "bibtex_key": "a3f9c2b1d4e76f85",
  "bibtex_type": "inproceedings",
  "fields": {
    "title": "Attention Is All You Need",
    "author": "Ashish Vaswani and Noam Shazeer and ...",
    "year": "2017",
    "booktitle": "NeurIPS",
    "doi": "10.5555/3295222.3295349"
  },
  "label": "VALID",
  "hallucination_type": null,
  "difficulty_tier": null,
  "explanation": "Valid entry scraped from DBLP and verified",
  "subtests": {
    "doi_resolves": true,
    "title_exists": true,
    "authors_match": true,
    "venue_correct": true,
    "fields_complete": true,
    "cross_db_agreement": true
  }
}

Evaluation Metrics

Metric	Description
Detection Rate (DR)	Recall on hallucinated entries
False Positive Rate (FPR)	Valid entries incorrectly flagged
F1-Hallucination	Harmonic mean of precision and recall on HALLUCINATED class
Tier-weighted F1	F1 weighted by difficulty (Tier 3 = 3x weight)
ECE	Expected Calibration Error — measures confidence calibration quality
detect@k	Fraction detected using k verification strategies (deterministic and order-dependent, unlike the stochastic pass@k)
MCC	Matthews Correlation Coefficient — prevalence-invariant; use as primary metric when comparing results across splits

Title-Oracle Baseline (Diagnostic)

The title_oracle baseline quantifies the ceiling of a perturbation-structure shortcut present in HALLMARK's design. Because most HALLUCINATED entries are generated by perturbing real (VALID) papers, they inherit the original title. This means a title that appears as VALID in the dev split almost certainly belongs to a perturbed — hence hallucinated — entry when it reappears in another split.

The oracle exploits this directly: if a blind entry's title matches any VALID title in the dev split, it predicts HALLUCINATED.

Empirical results on v1.0 data:

~33% of unique titles appear as both VALID and HALLUCINATED across dev/test splits.
Applied to the hidden split: F1 = 0.389 at perfect precision (P = 1.0, recall = ~0.24).
Titles absent from any valid pool are 100% HALLUCINATED in the dataset.

This is not a legitimate detection method — it requires access to dev ground-truth labels as a look-up table, which constitutes label leakage when evaluating on dev itself. Report it alongside real baselines to make the shortcut visible. Any real tool that achieves F1 below the title oracle on the hidden split is arguably exploiting benchmark structure rather than performing genuine citation verification.

from hallmark.baselines.title_oracle import run_title_oracle
from hallmark.dataset.loader import load_split

dev_entries  = load_split("dev_public")
test_entries = load_split("test_public")
blind_test   = [e.to_blind() for e in test_entries]

predictions = run_title_oracle(blind_test, reference_pool=dev_entries)

Main Results (dev_public, 1,119 entries)

Twelve full-coverage tools evaluated on dev_public. All numbers reproduce Table 1 of the paper. Bold = best among independent (non-co-designed) full-coverage tools. ΔFPR is the cross-split shift test_public − dev_public; — means no test_public evaluation.

Tool	DR ↑	FPR ↓	F1 ↑	MCC ↑	TW-F1 ↑	ECE ↓	ΔFPR ↓
Citation-database tools (with shared pre-screening)
DOI-only	.256	.195	.361	.093	.314	.143	+0.108
Zero-shot LLMs (sorted by FPR)
Gemini 2.5 Pro	.456	.053	.609	.446	.587	.321	+0.011
Claude Opus 4.7	.733	.060	.824	.672	.840	.112	−0.001
Claude Sonnet 4.6	.777	.095	.840	.677	.842	.066	+0.023
Gemini 2.5 Flash	.482	.101	.617	.406	.608	.286	+0.010
Llama 4 Maverick	.591	.150	.693	.446	.688	.197	+0.028
GPT-5.4 (zero-shot)	.744	.228	.775	.512	.792	.215	−0.005
Mistral Large	.691	.258	.731	.430	.743	.247	+0.045
GPT-5.1 (zero-shot)	.823	.405	.771	.432	.818	.189	+0.076
Qwen3-235B	.832	.551	.737	.307	.806	.294	+0.080
Qwen3-VL-235B	.834	.567	.735	.294	.804	.298	+0.085
DeepSeek-R1	.871	.640	.737	.273	.814	.247	−0.310
DeepSeek-V3.2	.880	.730	.721	.191	.805	.331	+0.047
Agentic (tool-use; up to 5 tool calls per entry)
GPT-5.1 + CrossRef/OpenAlex/arXiv	.956	.465	.827	.556	.895	.165	+0.058
GPT-5.1 + bibtex-updater (tool optional)	.965	.461	.832	.574	.901	.113	−0.116
Sonnet 4.6 + bibtex-updater (tool optional)	.970	.426	.845	.610	.908	.110	−0.092
Co-designed (reference upper bound)
bibtex-updater	.946	.179	.908	.781	.936	.297	+0.159
GPT-5.1 + bibtex-updater (always-call; output in prompt)	.818	.144	.846	.670	.856	.086	+0.110

DR = Detection Rate · FPR = False Positive Rate · TW-F1 = Tier-weighted F1 · MCC = Matthews Correlation Coefficient · ECE = Expected Calibration Error. The shaded co-designed block is a reference upper bound: bibtex-updater's development overlapped with the benchmark's taxonomy design, so its scores risk construct-overfitting and should not be compared head-to-head with independent tools. HaRC and verify-citations are omitted: Semantic Scholar throttling collapses their effective coverage to <7% on dev_public.

Key Takeaways

LLMs span a wide recall–precision spectrum. From ultra-conservative (Gemini 2.5 Pro: 46% DR, 5% FPR) to aggressive (DeepSeek-V3.2: 88% DR, 73% FPR). Claude Sonnet 4.6 and Opus 4.7 jointly lead independent tools on F1/calibration (Sonnet F1 = 0.840 / ECE = 0.066), far ahead of GPT-5.1 (F1 0.771) and the recall-aggressive open-weight cohort.
Agentic lookups inflate FPR. A 5-call budget closes GPT-5.1's recall gap to bibtex-updater (DR 0.97 vs. 0.95), but agentic FPR remains ~2.6× higher (0.46 vs. 0.18) because the harness flags an entry whenever any one of CrossRef/OpenAlex/arXiv returns no match. F1 still trails by 7.6 pp. Substituting Sonnet 4.6 reproduces the GPT-5.1 profile within ≤3.5 pp on every metric — the FPR rise is harness-driven, not LLM-driven.
Base-rate precision collapse. Extrapolated to real-world hallucination rates, every evaluated setting yields roughly one true hallucination per ten flagged citations, so recall-optimized verifiers misallocate reviewer effort.
Post-cutoff calibration breakdown. On 448 papers from 2024–2025, 8 of 12 LLMs over-flag sharply (FPR up to 0.89). Sonnet 4.6 and Opus 4.7 hold FPR ≤ 0.12; GPT-5.4 (FPR 0.41) and Gemini 2.5 Pro (FPR 0.25) only partially recover.
A capability gap remains. Even the highest-recall independent model misses 12% of hallucinations, with systematic weaknesses on subtle types (near_miss_title: 56%, author_mismatch: 58% for GPT-5.1). No tool dominates across regimes: bibtex-updater is cheapest and most temporally stable; Sonnet 4.6 / Opus 4.7 lead on FPR and PPV; the rule-based F1 lead collapses on test_public.

See the paper for the full per-tier, per-type, and temporal-robustness analyses.

External Tool Baselines

HALLMARK also wraps several external citation verification tools as baselines:

Baseline	Tool	Databases	Install
HaRC	harcx	Semantic Scholar, DBLP, Google Scholar, Open Library	`pip install harcx`
CiteVerifier	GhostCite	DBLP (local), Google Scholar, Google Search	Clone repo
hallucinator	hallucinator	CrossRef, arXiv, DBLP, Semantic Scholar, ACL Anthology, PubMed, OpenAlex	Clone repo
verify-citations	verify-citations	arXiv, ACL Anthology, Semantic Scholar, DBLP, Google Scholar, DuckDuckGo	`pip install verify-citations`

LLM Baselines

Baseline	Model	Provider	API Key Env Var
`llm_openai`	GPT-5.1	OpenAI	`OPENAI_API_KEY`
`llm_anthropic`	Claude Sonnet 4.6	Anthropic	`ANTHROPIC_API_KEY`
`llm_openrouter_deepseek_r1`	DeepSeek R1	OpenRouter	`OPENROUTER_API_KEY`
`llm_openrouter_deepseek_v3`	DeepSeek V3.2	OpenRouter	`OPENROUTER_API_KEY`
`llm_openrouter_qwen`	Qwen 3 235B	OpenRouter	`OPENROUTER_API_KEY`
`llm_openrouter_mistral`	Mistral Large	OpenRouter	`OPENROUTER_API_KEY`
`llm_openrouter_gemini_flash`	Gemini 2.5 Flash	OpenRouter	`OPENROUTER_API_KEY`

# Use the baseline registry to discover and run any baseline
from hallmark.baselines.registry import list_baselines, check_available, run_baseline
from hallmark.dataset.loader import load_split

entries = load_split("dev_public")

# List all registered baselines (or just the free ones)
print(list_baselines(free_only=True))

# Check if a baseline's dependencies are installed
available, msg = check_available("harc")

# Run a baseline by name
predictions = run_baseline("harc", entries)

Python API

from hallmark.dataset.loader import load_split
from hallmark.evaluation.metrics import evaluate
from hallmark.dataset.schema import Prediction

# Load benchmark entries
entries = load_split("dev_public")

# Create predictions (your tool's output)
predictions = [
    Prediction(bibtex_key=e.bibtex_key, label="VALID", confidence=0.5)
    for e in entries
]

# Evaluate
result = evaluate(entries, predictions, tool_name="my-tool", split_name="dev_public")
print(f"F1: {result.f1_hallucination:.3f}")
print(f"Detection Rate: {result.detection_rate:.3f}")

Ranking

HALLMARK includes an ONEBench-inspired ranking system based on the Plackett-Luce model that handles incomplete evaluation data (not all tools evaluated on all entries):

from hallmark.evaluation.ranking import rank_tools_plackett_luce, rank_tools_mean_score

# Rank tools using Plackett-Luce (requires choix: pip install hallmark[ranking])
pl_ranking = rank_tools_plackett_luce(entry_keys, tool_names, matrix)

# Fallback: simple mean-score ranking (no extra dependencies)
mean_ranking = rank_tools_mean_score(entry_keys, tool_names, matrix)

CI/CD

HALLMARK includes two GitHub Actions workflows:

tests.yml: Runs the full test suite across Python 3.10-3.13 on every push/PR
baselines.yml: Runs live free baselines (doi_only, verify_citations) weekly and on demand; harc and bibtexupdater use pre-computed result validation (checksum checks) instead of live re-execution due to API rate limiting

Contributing Entries

HALLMARK uses an ever-expanding pool inspired by ONEBench. To contribute new entries:

hallmark contribute --file my_entries.jsonl --contributor "Your Name"

See CONTRIBUTING.md for details on entry format, validation requirements, and the review process.

Project Structure

hallmark/
├── hallmark/                  # Python package
│   ├── dataset/               # Schema, loader, scraper, generator
│   ├── evaluation/            # Metrics, subtests, aggregator, temporal, ranking
│   ├── baselines/             # Registry + baselines (DOI-only, bibtex-updater, LLM×6, ensemble, HaRC, CiteVerifier, hallucinator, verify-citations)
│   │   └── registry.py        # Central baseline discovery, availability, dispatch
│   ├── contribution/          # Pool manager, entry validation
│   └── cli.py                 # Command-line interface
├── data/
│   ├── v1.0/                  # Benchmark splits (dev_public, test_public)
│   ├── hidden/                # Hidden test set (not public)
│   └── raw/                   # Raw scraped/generated entries
├── scripts/
│   └── run_all_baselines.py   # Batch orchestrator for baseline evaluation
├── .github/workflows/
│   ├── tests.yml              # CI: test suite across Python versions
│   └── baselines.yml          # CI: weekly free baseline evaluation
├── tests/                     # Test suite (562 tests)
├── figures/                   # Evaluation figures
└── examples/                  # Usage examples

Citation

If you use HALLMARK in your research, please cite:

@misc{hallmark2026,
    title={HALLMARK: A HALLucination benchMARK for Citation Verification},
    author={Reizinger, Patrik},
    year={2026},
    url={https://github.com/rpatrik96/hallmark}
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 269 Commits
.github/workflows		.github/workflows
data		data
docs		docs
examples		examples
figures		figures
hallmark		hallmark
notes		notes
results		results
scripts		scripts
tables		tables
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
HOSTING.md		HOSTING.md
LICENSE		LICENSE
README.md		README.md
croissant.json		croissant.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

HALLMARK

Why HALLMARK?

Features

Headline cascade results (v1.1)

Installation

Baseline Installation Guide

Quick Start

Evaluate a built-in baseline

Show dataset statistics

Run all baselines at once

Resume long-running LLM evaluations in parallel

View the leaderboard

Evaluate Your Tool

Prediction Fields

Hallucination Taxonomy

Tier 1: Easy (detectable by simple API lookup)

Tier 2: Medium (requires cross-referencing metadata)

Tier 3: Hard (requires deep verification)

Hosting & Croissant

Dataset

Splits

Subtest Definitions

Data Format

Evaluation Metrics

Title-Oracle Baseline (Diagnostic)

Main Results (dev_public, 1,119 entries)

Key Takeaways

External Tool Baselines

LLM Baselines

Python API

Ranking

CI/CD

Contributing Entries

Project Structure

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages