Skip to content

Commit 2ae88d8

Browse files
committed
Updated Danish tagger
1 parent b5e184c commit 2ae88d8

2 files changed

Lines changed: 30 additions & 34 deletions

File tree

docs/docs/usage/getting_started/intro.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ PyMUSAS currently support 11 different languages for the rule based tagger with
3333
| Language (BCP 47 language code) | MWE Support | Disk Space (MB) |
3434
| --- | --- | --- |
3535
| Mandarin Chinese (cmn) | :heavy_check_mark: | 1.28 |
36-
| Danish (da) | :heavy_check_mark: | 0.85 |
36+
| Danish (da) | :heavy_check_mark: | 0.82 |
3737
| Dutch, Flemish (nl) | :x: | 0.15 |
3838
| English (en) | :heavy_check_mark: | 0.86 |
3939
| Finnish (fi) | :x: | 0.64 |

docs/docs/usage/how_to/tag_text_with/rule_based_tagger.md

Lines changed: 29 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -125,19 +125,15 @@ Text POS MWE start and end index USAS Tags
125125
<details>
126126
<summary>Expand</summary>
127127

128-
First download both the [Danish PyMUSAS `RuleBasedTagger` spaCy component](https://github.com/UCREL/pymusas-models/releases/tag/da_dual_none_contextual_none-0.4.0) and the [small Danish spaCy model](https://spacy.io/models/da):
128+
First download both the [Danish PyMUSAS `RuleBasedTagger` spaCy component](https://github.com/UCREL/pymusas-models/releases/tag/da_dual_none_contextual_none-0.4.1) and the [small Danish spaCy model](https://spacy.io/models/da):
129129

130130
``` bash
131-
pip install https://github.com/UCREL/pymusas-models/releases/download/da_dual_none_contextual_none-0.4.0/da_dual_none_contextual_none-0.4.0-py3-none-any.whl
131+
pip install https://github.com/UCREL/pymusas-models/releases/download/da_dual_none_contextual_none-0.4.1/da_dual_none_contextual_none-0.4.1-py3-none-any.whl
132132
python -m spacy download da_core_news_sm
133133
```
134134

135135
Then create the tagger, in a Python script:
136136

137-
:::note
138-
Currently, there is no lemmatization component in the spaCy pipeline for Chinese.
139-
:::
140-
141137
``` python
142138
import spacy
143139

@@ -156,55 +152,55 @@ text = "Mindst 65% af Nilens vand kommer fra Den Blå Nil, som udspringer ved Ta
156152

157153
output_doc = nlp(text)
158154

159-
print(f'Text\tPOS\tUSAS Tags')
155+
print(f'{"Text":<20}{"Lemma":<20}{"POS":<8}USAS Tags')
160156
for token in output_doc:
161-
print(f'{token.text}\t{token.pos_}\t{token._.pymusas_tags}')
157+
print(f'{token.text:<20}{token.lemma_:<20}{token.pos_:<8}{token._.pymusas_tags}')
162158
```
163159

164160
<details>
165161
<summary>Output:</summary>
166162

167163
``` tsv
168-
Text POS USAS Tags
169-
Mindst ADV ['A13.7']
170-
65 NUM ['N1']
171-
% SYM ['Z99']
172-
af ADP ['Z5', 'E2-']
173-
Nilens PROPN ['Z2']
174-
vand NOUN ['O1.2', 'W3/M4', 'B1', 'C1%']
175-
kommer VERB ['K2']
176-
fra ADP ['K2']
177-
Den DET ['Z99']
178-
Blå ADJ ['O4.3', 'S1.2.4-', 'G1.2']
179-
Nil NOUN ['Z99']
180-
, PUNCT ['PUNCT']
181-
som PRON ['A13']
182-
udspringer VERB ['Z99']
183-
ved ADP ['Z5']
184-
Tanasøen PROPN ['Z99']
185-
i ADP ['Z5']
186-
Etiopien PROPN ['Z2']
187-
. PUNCT ['PUNCT']
164+
Text Lemma POS USAS Tags
165+
Mindst mindst ADV ['A13.7']
166+
65 65 NUM ['N1']
167+
% % SYM ['Z99']
168+
af af ADP ['Z5']
169+
Nilens Nilen PROPN ['Z2']
170+
vand vand NOUN ['O1.2', 'W3/M4', 'B1', 'C1%']
171+
kommer komme VERB ['K2']
172+
fra fra ADP ['K2']
173+
Den den DET ['Z99']
174+
Blå Blå ADJ ['O4.3', 'S1.2.4-', 'G1.2']
175+
Nil Nil NOUN ['Z99']
176+
, , PUNCT ['PUNCT']
177+
som som PRON ['A13']
178+
udspringer udspringe VERB ['Z99']
179+
ved ved ADP ['Z5']
180+
Tanasøen Tanasøen PROPN ['Z99']
181+
i i ADP ['Z5']
182+
Etiopien Etiopien PROPN ['Z2']
183+
. . PUNCT ['PUNCT']
188184
```
189185

190186
</details>
191187

192188
For Danish the tagger also identifies and tags Multi-Word Expressions (MWE), to find these MWE's you can run the following:
193189

194190
``` python
195-
print(f'Text\tPOS\tMWE start and end index\tUSAS Tags')
191+
print(f'{"Text":<20}{"POS":<8}{"MWE start and end index":<30}{"USAS Tags"}')
196192
for token in output_doc:
197193
start, end = token._.pymusas_mwe_indexes[0]
198194
if (end - start) > 1:
199-
print(f'{token.text}\t{token.pos_}\t{(start, end)}\t{token._.pymusas_tags}')
195+
print(f'{token.text:<20}{token.tag_:<8}{str((start, end)):<30}{token._.pymusas_tags}')
200196
```
201197

202198
Which will output the following:
203199

204200
``` tsv
205-
Text POS MWE start and end index USAS Tags
206-
kommer VERB (6, 8) ['K2']
207-
fra ADP (6, 8) ['K2']
201+
Text POS MWE start and end index USAS Tags
202+
kommer VERB (6, 8) ['K2']
203+
fra ADP (6, 8) ['K2']
208204
```
209205

210206

0 commit comments

Comments
 (0)