A comprehensive guide for designing YAML extraction templates in Hyper-Extract.
本指南中文版请查看 DESIGN_GUIDE_zh.md
- Quick Reference
- Part 1: Design Workflow
- Part 2: Type-Specific Design
- Part 3: Field Reference
- Part 4: Quality Assurance
- Part 5: Validation
Need relationships?
├─ No → Record types
│ ├─ Single object → model
│ ├─ Ordered list → list
│ └─ Deduplicated set → set
└─ Yes → Graph types
├─ Binary (A→B) → graph
└─ Multi-entity (A+B+C→D)
├─ Flat list → hypergraph (simple)
└─ Role groups → hypergraph (nested)
After graph:
├─ + time dimension → temporal_graph
├─ + space dimension → spatial_graph
└─ + both → spatio_temporal_graph
| Element | Convention | Example |
|---|---|---|
| Template name | CamelCase | EarningsSummary |
| Field names | snake_case | company_name |
| Relation type | type |
(not relation_type) |
| Time field | time |
(not event_date) |
| Tags | lowercase | finance, investor |
| Component | Max Fields | Priority |
|---|---|---|
| Entity fields | 5 | Essential → Important → Optional |
| Relation fields | 5 | Essential → Important → Optional |
| List fields | 3 | Essential → Important → Optional |
User Input → brainstorm → designer → optimizer → validator
↓ ↓ ↓ ↓
Type Selection YAML Gen Auto-fix Check
Clarify requirements and determine the extraction type.
Discussion Questions:
- What is the input source?
- What to extract?
- Key fields needed?
- Entity types and granularity?
- Relation types (predefined or custom)?
Output: Design draft with type, fields, and notes.
Generate YAML based on the design draft.
Output Template:
language: [zh, en]
name: TemplateName
type: [type]
tags: [domain]
description:
zh: '...'
en: '...'
output:
description: '...'
entities: # For graph types
description: '...'
fields: [...]
fields: # For record types
- name: field_name
type: str
description: '...'
guideline:
target: '...'
rules: [...] # For record types
rules_for_entities: [...] # For graph types
rules_for_relations: [...] # For graph types
identifiers: {}
display:
label: '...'Auto-fix common issues and apply best practices.
Auto-fix rules:
relation_type→typeevent_date→time- Mixed language → separated zh/en
Verify YAML correctness.
Validation checklist:
- language: zh/en
- name: CamelCase
- type: valid AutoType
- tags: lowercase array
- description: non-empty
- output: exists
- guideline: exists
Schema defines WHAT, Guideline defines HOW TO DO WELL.
| Schema Defines | Guideline Defines |
|---|---|
| Field names | Extraction strategy |
| Field types | Quality requirements |
| Field descriptions | Creation conditions |
| Required/optional | Common mistakes |
❌ Wrong: Guideline repeats field definitions ✅ Correct: Guideline explains how to extract well
Use when: Extracting a single record with multiple fields.
Template:
name: EarningsSummary
type: model
tags: [finance]
output:
description: '...'
fields:
- name: company_name
type: str
description: 'Company name'
required: true
- name: revenue
type: str
description: 'Revenue amount'
required: false
guideline:
target: 'You are a financial analyst...'
rules:
- 'Extract precise figures consistent with the source'
- 'Follow the format in the original text'
identifiers: {}
display:
label: '{company_name}'Design Checklist:
- All fields have clear semantic meaning?
- Field types are appropriate (str/int/float/list)?
- Required vs optional is reasonable?
- Default values are safe/meaningful?
- Display label references correct fields?
- Guideline does NOT repeat field definitions?
Use when: Extracting ordered items (rankings, sequences, bullet points).
Template:
name: KeywordList
type: list
tags: [general]
output:
description: '...'
fields:
- name: term
type: str
description: 'Keyword or term'
- name: rank
type: int
description: 'Order position'
required: false
guideline:
target: 'You are a keyword extraction expert...'
rules:
- 'Extract terms in the order they appear'
- 'Maintain ranking order if explicitly stated'
identifiers: {}
display:
label: '{term}'Design Checklist:
- Items have consistent structure?
- Order is meaningful and preserved?
- No redundant fields?
Use when: Extracting unique entities (entity registry, keyword list).
Template:
name: EntityRegistry
type: set
tags: [general]
output:
description: '...'
fields:
- name: name
type: str
description: 'Entity name'
- name: category
type: str
description: 'Entity type'
required: false
guideline:
target: 'You are an entity recognition expert...'
rules:
- 'Extract all unique entities'
- 'Maintain consistent naming across the text'
identifiers:
item_id: name
display:
label: '{name}'Design Checklist:
- item_id can uniquely identify records?
- Deduplication rules are clear?
- Similar entities are handled consistently?
Use when: Modeling relationships between two entities (A→B).
Template:
name: OwnershipGraph
type: graph
tags: [finance]
output:
description: '...'
entities:
description: 'Organization entities'
fields:
- name: name
type: str
description: 'Entity name'
- name: type
type: str
description: 'Entity type'
relations:
description: 'Ownership relationships'
fields:
- name: source
type: str
description: 'Owner entity'
- name: target
type: str
description: 'Owned entity'
- name: type
type: str
description: 'Relation type'
guideline:
target: 'You are a knowledge graph expert...'
rules_for_entities:
- 'Extract entities valuable for understanding the text'
- 'Maintain consistent naming throughout'
rules_for_relations:
- 'Create relations only when explicitly stated'
- 'Prefer relation words that appear in the text'
identifiers:
entity_id: name
relation_id: '{source}|{type}|{target}'
relation_members:
source: source
target: target
display:
entity_label: '{name} ({type})'
relation_label: '{type}'Design Checklist:
- Entity types cover key concepts?
- Entity granularity appropriate?
- Relation types semantically clear?
- source/target reference valid entities?
- No ambiguous relations?
- relation_label doesn't repeat source/target?
Use when: Modeling complex relationships with multiple participants.
Use when: All participants have equal roles.
relations:
fields:
- name: event_name
type: str
description: 'Event or mechanism name'
- name: participants
type: list
description: 'List of participating entities'
- name: type
type: str
description: 'Relation type'
- name: outcome
type: str
description: 'Result or conclusion'
required: false
identifiers:
entity_id: name
relation_id: '{event_name}|{type}'
relation_members: participants # STRINGUse when: Participants have distinct semantic roles.
relations:
fields:
- name: event_name
type: str
description: 'Event name'
- name: group_a
type: list
description: 'Group A participants (e.g., attackers)'
- name: group_b
type: list
description: 'Group B participants (e.g., defenders)'
- name: outcome
type: str
description: 'Result'
- name: reasoning
type: str
description: 'Explanation'
required: false
identifiers:
entity_id: name
relation_id: '{event_name}'
relation_members: [group_a, group_b] # LISTCommon Grouping Patterns:
| Scenario | Groups | Use Case |
|---|---|---|
| Formula | sovereigns, ministers, assistants, envoys | TCM formulas |
| Battle | attackers, defenders | Military conflicts |
| Transaction | buyers, sellers, intermediaries | Commercial deals |
| Contract | parties, witnesses | Legal documents |
Design Checklist:
- How many semantic groups?
- What are the group names?
- Should use nested grouping instead of role field?
- Participant count reasonable?
- Outcome defined?
Use when: Relationships have temporal aspects.
Add to graph:
relations:
fields:
- name: source
type: str
- name: target
type: str
- name: type
type: str
- name: time
type: str
description: 'When the relation occurred'
required: false
identifiers:
entity_id: name
relation_id: '{source}|{type}|{target}|{time}'
relation_members:
source: source
target: target
time_field: time
guideline:
rules_for_time:
- 'Observation time: {observation_time}'
- 'Absolute dates: Keep as-is (e.g., 2024-01-01)'
- 'Relative time: Convert to absolute'
- 'Fuzzy time: Leave empty, do not guess'Design Checklist:
- Time is edge property, not node property?
- Format handling rules clear?
- Relative time conversion specified?
Add to graph:
relations:
fields:
- name: source
type: str
- name: target
type: str
- name: type
type: str
- name: location
type: str
description: 'Where the relation occurred'
required: false
identifiers:
location_field: location
guideline:
rules_for_location:
- 'Observation location: {observation_location}'
- 'Structured: Keep as-is'
- 'Fuzzy: Use observation_location'Combines temporal_graph and spatial_graph.
identifiers:
time_field: time
location_field: locationentities:
description: 'Entity definitions'
fields:
- name: name
type: str
description: 'Entity name (unique identifier)'
- name: type
type: str
description: 'Entity type/category'
- name: description
type: str
description: 'Brief description'
required: falserelations:
description: 'Relation definitions'
fields:
- name: source
type: str
description: 'Source entity'
- name: target
type: str
description: 'Target entity'
- name: type
type: str
description: 'Relation type'identifiers:
entity_id: name
relation_id: '{source}|{type}|{target}'
relation_members:
source: source
target: targetidentifiers:
relation_members: participants # STRINGidentifiers:
relation_members: [group_a, group_b] # LIST| Type | entity_label | relation_label |
|---|---|---|
| graph | {name} ({type}) |
{type} |
| hypergraph | {name} |
{event_name} or {outcome} |
| temporal | {name} ({type}) |
{type}@{time} |
| spatio_temporal | {name} ({type}) |
{type}@{location}({time}) |
Length guidelines:
- entity_label: 5-20 characters
- relation_label: 10-30 characters
Core Principle: Each language field should use that language's own terminology.
| Field | Rule | Example |
|---|---|---|
zh |
Pure Chinese | 类型:实体 |
en |
Pure English | Type: entity |
Forbidden patterns:
- ❌
zhwith English terms:entity(实体) - ❌
enwith Chinese characters:类型
Translation patterns:
| Chinese | English |
|---|---|
| 实体 | entity |
| 抽象概念 | abstract |
| 过程 | process |
| 关系 | relation |
| 高/中/低 | high/medium/low |
Max fields per component: 5
| Priority | Description |
|---|---|
| Essential | source, target, participants |
| Important | type, time, location |
| Optional | description, metadata |
Simplification strategy: If >5 fields, consider:
- Splitting into multiple templates
- Moving optional fields to description
- Removing redundant fields
| Issue | Fix |
|---|---|
relation_type |
→ type |
event_date |
→ time |
entity_type |
→ type |
Mixed language in zh |
Extract entity(...) → 实体 |
Chinese chars in en |
Translate to English |
- language: zh/en
- name: CamelCase
- type: valid AutoType
- tags: lowercase array
- description: non-empty
- output: exists
- guideline: exists
- output.entities: exists
- output.relations: exists
- identifiers.entity_id: exists
- identifiers.relation_id: exists
- identifiers.relation_members: configured
- relation_members is string OR list
- If list: all fields are type: list
- identifiers.time_field: configured (temporal)
- identifiers.location_field: configured (spatial)
| Error | Fix |
|---|---|
| Missing required field | Add the field |
| Invalid type value | Use valid AutoType |
| Mixed language | Separate zh/en content |
| Field count >5 | Simplify or split |
Need single object → model
Need list → list
Need deduplication → set
Need binary relations → graph
Need multi-party relations → hypergraph
Need time → temporal_graph
Need location → spatial_graph
Need both → spatio_temporal_graph
templates/
├── presets/
│ ├── general/ # 13 templates (8 base + 5 domain-specific)
│ ├── finance/ # 5 templates
│ ├── medicine/ # 5 templates
│ ├── tcm/ # 5 templates
│ ├── industry/ # 5 templates
│ └── legal/ # 5 templates
├── DESIGN_GUIDE.md # This guide
├── DESIGN_GUIDE_zh.md # Chinese version
├── README.md # Template catalog
└── README_ZH.md # 中文目录
Part of the Hyper-Extract project. See root LICENSE.