Skip to content

Commit 83e5074

Browse files
committed
merging TAG work
2 parents fc219d1 + 2b8af25 commit 83e5074

6 files changed

Lines changed: 90 additions & 9 deletions

File tree

23.9 KB
Loading
31.4 KB
Loading
19.2 KB
Loading
11.5 KB
Loading
48 KB
Loading

schedule/week_2/tag.md

Lines changed: 90 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,103 @@
11
# Text As Graph (TAG)
22

3-
The Text As Graph (TAG) model conceptualizes of documents as a *hypergraph* for text.
3+
The Text As Graph (TAG) model conceptualizes of documents as a *hypergraph* for text. As you may be unfamiliar with a hypergraph model, we will briefly outline its key features. Like any *graph*, a hypergraph consists of nodes and edges, where the edges connect one node to another. A hypergraph contains both regular edges, which connect one node to one other node, and *hyperedges*, which connect a *set of nodes* to another *set of nodes*.
44

5-
As you may be unfamiliar with a hypergraph model, we will briefly outline its key points. A *graph* consists of nodes and edges, where the edges connect one node to another. A hypergraph contains both regular edges, which connect one node to one other node, and *hyperedges*, which connect a *set of nodes* to another *set of nodes*. For more information about modeling text as a hypergraph see [“It’s more than just overlap: Text As Graph”](https://www.balisage.net/Proceedings/vol19/html/Dekker01/BalisageVol19-Dekker01.html) <!--- add reference to 2018 Balisage paper here the moment we got it -->
5+
You can find more information about modeling text as a hypergraph in [“It’s more than just overlap: Text As Graph”](https://www.balisage.net/Proceedings/vol19/html/Dekker01/BalisageVol19-Dekker01.html). <!--- add reference to 2018 Balisage paper here the moment we got it --> For now, keep in mind that the hypergraph is a powerful data structure that allows you to represent a greater quantity of textual information in an inclusive and more refined way.
66

77
## Why are we looking at TAG?
88

9-
The only document model in wide use in digital editions projects is XML, which is the only technology that has sufficient maturity and a sufficiently large community to be practical for general production purposes. The reason we nonetheless introduce TAG (and [LMNL](lmnl_syntax.md)) is that looking at non-XML ways of modeling documents encourages developers to think first about the model, and then about the relationship of the model to the syntax. In other words, thinking about how to model documents in LMNL and TAG can improve the quality of our models, whether we use XML or an alternative.
9+
The only document model in wide use in digital editions projects is XML, which is the only technology that has sufficient maturity and a sufficiently large community to be practical for general production purposes. The reason we nonetheless introduce TAG (and [LMNL](lmnl_syntax.md)) is that looking at non-XML ways of modeling documents encourages us to think first about the model, and then about the relationship of the model to the syntax. In other words, thinking about how to model documents in LMNL and TAG can improve the quality of our models, whether we use XML or an alternative.
1010

11-
This tutorial is based on the first version of TAG, which has undergone several changes since it was introduced (e.g., directed Text-to-Text edges have been replaced by undirected ones; TAGML serves as a markup language for TAG, etc.). The differences are not important for the purpose of using TAG to encourage critical thinking about document modeling, so we have kept the original description, even though it has now been superseded in some respects. Up-to-date information about TAG is maintained at the [TAG portal on GitHub](https://github.com/HuygensING/tag).
11+
A study of TAG's features, therefore, serve the purpose of encouraging critical thinking about document modeling. The TAG model is under active development, so in the following paragraphs we will not discuss its syntax, query language or schema, but we focus on the properties of the data model.
1212

13-
## TAG counterparts to XML `text()` nodes
13+
## TAG edges
14+
The TAG data model distinguishes a number of different edges; below we describe just the two main ones.
1415

15-
The text in a TAG document is a sequence of Text nodes, where the sequence begins with a Document node. The simplest TAG document, which contains only text and no markup, looks something like:
16+
### TAG undirected edges
17+
All edges in TAG's hypergraph are undirected. The graph models you may be more familiar with, such as a variant graph, have directed edges. This means the edges can only be traversed from node A to node B. Undirected edges, conversely, can be traversed in both ways.
1618

17-
![](images/tag_no_markup.png)
19+
### TAG hyperedges
20+
TAG uses hyperedges to associate markup with its textual content. Hyperedges can connect one or more nodes with each other, in contrast to regular edges that connect one node to another node. This means that there can be multiple Markup nodes on one Text node. An example of a hyperedge is given below.
1821

19-
## TAG Markup-to-Text hyperedges
22+
## TAG Nodes
23+
24+
The TAG model distinguishes four kinds of nodes in the hypergraph. They are briefly described below and illustrated using a simple example.
25+
26+
### TAG Document node
27+
The Document node represents a single TAG document. It marks the start of a sequence of Text nodes and serves as a root node. See the root node in the image below.
28+
29+
### TAG Text nodes
30+
31+
A Text node represents (a part of) the textual content of the document. Whitespace is included in the textal content. The simplest TAG document, which contains only text and no markup, looks something like:
32+
33+
![](images/tag_no_markup_update.png)
34+
35+
### TAG Markup nodes
36+
Markup nodes store the name of the markup. They are connected to one or more Text nodes with an hyperedge. In the figure below, the hyperedge connects the Markup node `verb` with the Text node containing `est`.
37+
38+
![](images/tag_markup_update.png)
39+
40+
### TAG Annotation nodes
41+
Annotations in TAG are comparible to XML attributes. Information is stored as a key:value pair. Annotation nodes have two properties: the name of the annotation (the key) and the value of the annotation (the value).
42+
43+
Below an illustration of an annotation with the key `POS` and the value `fin` on the Markup node:
44+
45+
![](images/tag_annotation_update.png)
46+
47+
48+
## Modeling text in TAG
49+
We mentioned above that the properties of the hypergraph for text data model cater for the modeling of complex text features. In other words: what's hard in XML is not hard in TAG.
50+
51+
Let's take a look at the textual examples we used when illustrating [the limitations of XML](https://github.com/Pittsburgh-NEH-Institute/Institute-Materials-2017/blob/master/schedule/week_2/xml_limitations.md) and see how they translate to TAG.
52+
53+
### Overlap
54+
55+
Consider a fragment of Percy Bysshe Shelley’s “Ozymandias” (1818):
56+
57+
58+
> Who said—“Two vast and trunkless legs of stone
59+
> Stand in the desart ...
60+
61+
What in a XML transcription leads to overlapping structures and thus not well formed XML:
62+
63+
```xml
64+
<line><phrase>Who said —</phrase> <phrase>“Two vast and trunkless legs of stone</line>
65+
<line>Stand in the desart….</phrase></line>
66+
```
67+
68+
is easily expressed in TAG:
69+
70+
![](images/tag_overlap_update.png)
71+
72+
The phrase “Two vast and trunkless legs of stone stand in the desart” is split between two lines, each of which also contains other phrases. There is no valid way to mark this up in XML except by prioritizing one hierarchy (phrases or lines) and representing the other with empty milestones. In TAG, however, neither hierarchy is primary; phrases and lines both contain Text nodes, and both types of relationships are encoded in the same way. (See also a [complete graphic representation of “Ozymandias”](images/ozymandias_hypergraph.svg), generated by Alexandria.)
73+
74+
### Discontinuity
75+
76+
```xml
77+
<q>"and what is the use of a book,"</q> thought Alice <q>"without pictures or conversation?"</q>
78+
```
79+
80+
can be modeled in TAG:
81+
82+
![](images/tag_discontinuity_update.png)
83+
84+
## TAG syntax
85+
TAGML stands for _TAG Markup Language_ and, as syntax, it is a serialization of the TAG model. It is designed to represent in a straightforward manner all features of a text.
86+
87+
A simple TAGML example is:
88+
89+
```
90+
[line>The rain in Spain falls mainly on the plain.<line]
91+
```
92+
93+
with the `[line>` being the start-tag and the `<line]` being the end-tag. For every start-tag there should be an end-tag, and vice versa.
94+
95+
## Curious about TAG?
96+
97+
As noted above, up-to-date information about TAG is maintained at the [TAG portal on GitHub](https://github.com/HuygensING/tag). Also take a look at the [Balisage 2018 paper](https://www.balisage.net/Proceedings/vol21/html/HaentjensDekker01/BalisageVol21-HaentjensDekker01.html) for more details on TAGML and its relation to existing markup languages.
98+
99+
100+
<!--- ### TAG hyperedges
20101
21102
TAG is a data model that does not (yet) have its own markup language, but the [Alexandria](../week_3/alexandria.md) implementation of TAG is capable of importing documents that have been marked up using LMNL sawtooth syntax. In this context, the sawtooth syntax is used to represent parts of the TAG hypergraph model, rather than the LMNL range model. The fact that the same syntax can be used to represent features of two data models highlights the difference between the data model and the syntax.
22103
@@ -64,4 +145,4 @@ In this example, Markup nodes of type “phrase” and their associated hyperedg
64145
65146
## Curious about TAG?
66147
67-
As noted above, up-to-date information about TAG is maintained at the [TAG portal on GitHub](https://github.com/HuygensING/tag).
148+
As noted above, up-to-date information about TAG is maintained at the [TAG portal on GitHub](https://github.com/HuygensING/tag). -->

0 commit comments

Comments
 (0)