A semi-automated annotation system for high-precision extraction and graph generation of causal relationships using form-meaning pairings.
Based on the theory of Construction Grammar, a Constructicon is a comprehensive lexicon of constructions that a speaker learns, remembers, and uses to generate and understand language. Every construction in this inventory is a conventionally learned pairing of a surface form and an associated meaning. In this case, we are focusing on the set of constructions that relate causes and effects in natural language.
The primary goal of this project is to develop a high-precision, deterministic, annotation system to identify and extract these causal constructions from text with a human in the loop. A manual annotation option allows users to pinpoint causal links that might have been missed and add new constructions to the inventory. The resulting annotations can then be transformed into a graph, or used as training data for machine learning pipelines. As a case study, we are using accident reports from the National Transportation Safety Bureau, as they are rich with causal links.
C++ Dependencies:
- C++17 Compiler: Use a compatible compiler (e.g., g++ 7+ or clang 5+).
- JSON Header: Make sure that you have the json.hpp in the project directory.
Python Dependencies:
- Install the required Python packages for the CSV to RDF graph generator:
pip install pandas rdflib# compile the constructicon and annotator
g++ -std=c++17 -o annotator constructicon-simple.cpp
# run the annotator
./annotatorThe system automatically initializes when you compile and run the program:
- Constructicon Initialization: Loads all causal constructions and regex patterns from
constructions.handpatterns.h - Annotator Initialization: Loads accident records from
cleaned_data.json - Static Initializers: The
ConstructiconInitializerandAnnotatorInitializerobjects run automatically at program start
cleaned_data.json- sample text from NTSB accident reportsconstructions.h- 152 causal constructionspatterns.h- 96 regex patterns for matching (omitting noisy triggers)
The minimal_checker utility shows what data has been loaded and the number of records processed so far:
# compile the checker
g++ -std=c++17 -o minimal_checker minimal_checker.cpp constructicon-simple.cpp
# run the checker
./minimal_checker- CausalConstruction
struct CausalConstruction {
std::string id; // unique ID for a construction, e.g. "C001", "C146"
CausalDegree degree; // Facilitate or Inhibit, depending on polarity
CausalOrder order; // CE (cause -> effect) or EC (effect -> cause)
std::string trigger_template; // trigger in context of the construction
std::string example; // example of the construction in context
};- CausalPattern
struct CausalPattern {
std::string description; // human-readable description of the construction in context
std::regex pattern; // regular expression, e.g. std::regex(R"(\bis\s+why\b)", std::regex::icase)
std::vector<std::string> ids; // one or multiple Construction IDs, e.g. {"C146"}
ParseMethod parse_method; // parse method used for pattern matching: FullAuto, SemiAuto, or Manual
};- Record
struct Record {
int recordID; // NTSB accident record ID, e.g. 193383
std::string probableCause; // NTSB probable cause statement following an investigation.
};- AnnotationEntry
struct AnnotationEntry {
std::string constructionID; // Construction ID, e.g. "C146"
int recordID; // Record ID, e.g. 193383
std::string trigger; // word, phrase, or pattern that evokes a construction, e.g. "due to"
std::string cause; // span of text that encompasses the cause associated with the connector
std::string effect; // span of text that encompasses the effect associated with the connector
AnnotationStatus status; // verification status: Verified, Candidate, Rejected, or Unknown
ParseMethod parse_method; // annotator sets the parse method: FullAuto, SemiAuto, or Manual
};These structures are all contained in vectors:
std::initializer_list<CausalConstruction> constructions
std::initializer_list<CausalPattern> patterns
std::vector<Record> records
std::vector<AnnotationEntry> annotations- Pattern matching - System finds potential causal connectors using regex patterns
- User validation - Review each match, label cause/effect spans
- Manual entry - Add connectors missed by automatic matching
- Progress tracking - Resume where you left off using
progress.txt
During annotation in the terminal, triggers are highlighted with ANSI escape colors:
- π΄ Red (Rejected): User-rejected match
- π‘ Yellow (Candidate): Potential match found by pattern search
- π’ Green (Verified): Confirmed causal connector
When a match is found, it is highlighted in yellow to show that it is a candidate. If the user confirms that it is valid, it changes to green. If the user rejects it, it turns to red. These colors help visually track progress while working in one record.
Choose one of three parsing methods when annotating causal connectors:
| Type | What it does | When to Use | Example |
|---|---|---|---|
| FullAuto | Automatically validates trigger as causal connector | Trigger always indicates causality | "because", "resulting in" |
| SemiAuto | Automatically flags candidate, prompts validation | Trigger typically indicates causality | "arises from", "promotes" |
| Manual | Never searches for trigger, requires manual entry | Trigger is noisy or polysemous | "for", "from", "when" |
These parse methods can be set during the manual annotation phase only. In future versions, these settings will be incorporated into the pattern structs and applied during the pattern matching process.
annotations.csv- Verified causal relationships (construction_id, record_id, trigger, cause, effect)
construction_id,record_id,trigger,cause,effect,status
C148,193383,"contributed to","pilot error","accident",Verified
TK,193384,"led to","mechanical failure","crash",Verifiedcausal_links.ttl- RDF knowledge graph in Turtle format (generated viacsv_to_rdf.py)
[] a :Causation ;
:cause "The failure of the alternate gear extension system" ;
:connector "prevented" ;
:effect "the landing gear from being lowered" ;
:constructionID "C039" ;
:recordID "193196" .After annotating records, convert your annotations.csv to an RDF knowledge graph:
# run the converter
python csv_to_rdf.pyThis generates a Turtle file causal_links.ttl containing:
- Reified causation statements (cause β effect relationships)
- Causal connectors used
- Construction IDs and record IDs for traceability
βββ constructicon-simple.h/.cpp # Core library & annotation logic
βββ constructions.h # 152 causal construction definitions
βββ patterns.h # 96 regex patterns for matching
βββ json.hpp # JSON parsing library (nlohmann)
βββ cleaned_data.json # NTSB accident reports (input)
βββ annotations.csv # Verified causal relationships (output)
βββ progress.txt # Session progress tracking
βββ causal_links.ttl # RDF knowledge graph (example generated output)
βββ system_diagram_dark.png # System workflow diagram (dark theme)
βββ csv_to_rdf.py # CSV β RDF converter
βββ minimal_checker.cpp # Data loading verification utility
βββ tests.cpp # Unit tests for core functionality
βββ graphviz_example.dot # Dot format graph visualization of causal chain
βββ graphviz_example.png # GraphViz visualization of causal chain sequence
βββ README.md # This documentation file
- Shallow semantic parsing based on construction grammar
- Expedites and enhances human annotation tasks
- Three pattern identification modes: full auto, semi auto, manual
- Progress tracking so that progress can continue over multiple sessions
- Color-coded highlighting for validation
- Annotations can be used as gold training data for machine learning
- CSV export to prepare for graph generation
- RDF knowledge graph generation
NTSB Reports β Pattern Matching β User Validation β CSV Output β RDF Knowledge Graph
Complete workflow from raw data through annotation to RDF knowledge graph generation:
Here is an example of a causal chain extracted from the annotation process, converted into an RDF graph, and visualized in GraphViz.
Allen, J. F. (1983). Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11), 832β843.
Cowan, R. (2008). The teacherβs grammar of English. Cambridge University Press.
Croft, W., Kalm, P., & Regan, M. (2022). Decomposing events and storylines. In Decompositional event structures in language, cognition, and computation (pp. 67β86). Cambridge University Press.
Croft, W., PeΕ‘kovΓ‘, P., & Regan, M. (2017). Integrating decompositional event structure into storylines. In T. Caselli, B. Miller, M. van Erp, & others (Eds.), Proceedings of the Workshop on Events and Stories in the News (pp. 98β109). Association for Computational Linguistics.
Dixon, R. (Ed.). (2006). Complementation: A cross-linguistic typology. Oxford University Press.
Dixon, R., & Aikhenvald, A. (Eds.). (2009). The semantics of clause linking: A cross-linguistic typology. Oxford University Press.
Dunietz, J. (2018). Annotating and automatically tagging constructions of causal language (Doctoral dissertation). Carnegie Mellon University, Pittsburgh, PA.
Dunietz, J. (n.d.). GitHub profile. Retrieved from https://github.com/duncanka
Dunietz, J. (n.d.). BECAUSE: The BECauSE Corpus and tools. GitHub repository. Retrieved from https://github.com/duncanka/BECAUSE
Dunietz, J. (n.d.). Causeway: A causality annotation and analysis toolkit. GitHub repository. Retrieved from https://github.com/duncanka/Causeway
Dunietz, J. (n.d.). LSTM Causality Tagger. GitHub repository. Retrieved from https://github.com/duncanka/lstm-causality-tagger
Dunietz, J., Levin, L., & Carbonell, J. (2017). The BECauSE Corpus 2.0: Annotating causality and overlapping relations. In LAW XI β The 11th Linguistic Annotation Workshop.
Dunietz, J., Levin, L., & Carbonell, J. (2018). DeepCx: A transition-based approach for shallow semantic parsing with complex constructional triggers. In Proceedings of EMNLP 2018.
Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure. University of Chicago Press.
Gordon, A., & Hobbs, J. (2017). A formal theory of commonsense psychology: How people think people think. Cambridge University Press.
Hobbs, J. (2005). Toward a Useful Concept of Causality for Lexical Semantics. Journal of Semantics, 22, 181β209. Oxford University Press.
Matthiessen, C., & Thompson, S. (1988). The structure of discourse and βsubordination.β In M. Haiman & S. Thompson (Eds.), Clause combining in grammar and discourse (pp. 275β329). John Benjamins Publishing.
National Transportation Safety Board. (n.d.). NTSB accident reports. Retrieved from https://carol.ntsb.gov
Rezvani Kalajahi, S., Abdullah, A., Mukundan, J., & Tannacito, D. (2012). Discourse connectors: An overview of the history, definition and classification of the term. World Applied Sciences Journal, 19(11), 1659β1673.
Rezvani Kalajahi, S., Neufeld, S., & Abdullah, A. (2017). The discourse connector list: A multi-genre cross-cultural corpus analysis. Text & Talk, 37(3), 283β310.
Talmy, L. (1988). Force dynamics in language and cognition. Cognitive Science, 12(1), 49β100.
Talmy, L. (2000). Toward a cognitive semantics, Volume 1: Concept structuring systems. MIT Press.
This project incorporates data derived from publicly available National Transportation Safety Board (NTSB) accident reports.
It also builds on the comprehensive research program on causal constructions developed by Jesse Dunietz. Many of the reference constructions in constructions.h are adapted from Appendix C of his 2018 dissertation. His linked repositories, released under the MIT License, informed the methodology used here.

