potools/vignettes/developers.Rmd at d97f7ecd2814c1eef23d8706e3a527ca08f9d202 · MichaelChirico/potools · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
---
title: "Translation for package developers"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Translation for package developers}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(glue)
```

In this vignette you'll learn how to set up your package for translation, focusing on translating messages in your R code.
It's aimed at a package developers; if you're translating an existing package, you might want to start with `vignette("translators")`.

```{r setup}
library(potools)
```

## Basic process

Before we get into the details lets review the basic process:

-   You run `po_extract()` to extract all translatable messages from your R and C code.
    This creates a `.pot` (po template) file that contains every translatable message.

-   A translator calls `po_create()` to generate a `.po` file for their language.
    A `.po` file consists of pair of lines like:

        msgid "This is the message in English"
        msgstr ""

    They then replace each `msgstr` line with the appropriate translation:

        msgid "This is the message in English"
        msgstr "This is the message in another language"

-   Either you or the translator uses `po_compile()` to turn the plain text `.po` files into binary `.mo` files that are distributed with your package.

## Extraction

potools provides two styles for extracting messages for translation: `base` and `explicit`, as described below.
Once you've decided which style you want to use, record it in the `DESCRIPTION`:

    Config/potools/style: explicit

And then run `po_extract()` to generate the `.pot` file.

### Base style

The base style captures messages the base functions that include built-in translation capabilities[^1]: `message()`, `warning()`, and `stop()`. It also captures messages from the explicit translation functions `gettext()`, `gettextf()`, and `ngettext().` (It does **not**, however, not translate `cat()`).
The advantage of the base style, is that it's very quick to get started with.

[^1]: You can tell they have translation capabilities because they include the `domain` argument; behind the scenes they all call `gettext()`.

`message()`, `warning()` and `stop()` concatenate the components of `…`:

```{r, error = TRUE}
message("This", " is", " a", " message")
warning("This", " is", " a", " warning")
stop("This", " is", " an", " error")
```

However, as you'll learn shortly, this style is unlikely to generate messages that are easily translated, so `po_extract(style = "base")` will also capture messages from `messagef()`, `warningf()`, and `stopf`().
These are equivalents to `message()`, `warnings()`, and `stop()` that use `sprintf()` style (hence the `f` suffix).

These functions are not included in base R, so if you want to use them, you'll need to copy the definitions from below:

```{r}
messagef <- function(fmt, ..., appendLF = TRUE) {
  msg <- gettextf(fmt, ..., domain = "R-{mypackage}")
  message(msg, domain = NA, appendLF = appendLF)
}

warningf <- function(fmt, ..., immediate. = FALSE, noBreaks. = FALSE) {
  msg <- gettextf(fmt, ..., domain = "R-{mypackage}")
  warning(msg,
    domain = NA,
    call. = FALSE,
    immediate. = immediate.,
    noBreaks. = noBreaks.
  )
}

stopf <- function(fmt, ...) {
  msg <- gettextf(fmt, ..., domain = "R-{mypackage}")
  stop(msg, domain = NA, call. = FALSE)
}
```

### Explicit style

The explicit style only captures messages that explicitly flagged for translation by `gettext()`, `ngettext()`, or `tr_()`.
Like `messagef()` and friends, `tr_()` is not provided by R, so you'll need to define it yourself:

```{r}
tr_ <- function(...) {
  enc2utf8(gettext(paste0(...), domain = "R-{mypackage}"))
}
```

The advantage of the explicit style is that it's very clear which messages are ready for translated.
The disadvantage is that it's easy to miss string, so in the future we'll provide automated ways to identify strings that haven't been translated and probably should be.

I'll use the explicit style in the rest of this vignette because it makes it very clear what is being translated.

## Writing good messages

The mechanics of translating your package are quite straightforward.
The bigger challenge is writing messages that are easy to translate.
In part, this is an extension of writing messages that are easy to understand in English as well!
And if it's hard for a native English speaker to understand your message, it's going to be even harder once it's translated into another language.
The following sections give some advice about how to write good messages, as inspired by the "[Preparing translatable strings](https://www.gnu.org/software/gettext/manual/html_node/Preparing-Strings.html#Preparing-Strings%20(Inspired%20by%20from%20))" section of the gettext[^2] manual.

One important point is to avoid templating out sentence fragments, especially words. Put message pieces into a template if and only if they are untranslatable (e.g., proper nouns, function names/R code).

For example, it is bad practice to conditionally template a word or phrase into a message:

```{r}
# BAD
in_parallel <- TRUE
run_type <- if (in_parallel) "in parallel" else "sequentially"
sprintf(tr_("Running job %s."), run_type)
```

The translator only sees `"Running job %s."`, and cannot adapt the translation for the two cases. It's much better to use two different, complete messages:

```{r}
# GOOD
in_parallel <- TRUE
if (in_parallel) {
  tr_("Running job in parallel.")
} else {
  tr_("Running job sequentially.")
}
```

This gives the translator full context. Why is this important? Because different words can require entirely different sentence structures in other languages. For example, in Polish, "to run [a job] sequentially" can be expressed with a specific verb, `szeregować`. A natural translation of "Running job sequentially." might be "Szeregowanie zadania.". For "Running job in parallel.", one would likely use a different verb and structure, like "Uruchamianie zadania równolegle.".

The template in the "BAD" example is too rigid. It forces the translator to find a single translation for "Running job %s." that works with both "sequentially" and "in parallel", which is unnatural and may not even be possible. The "GOOD" example, by providing two distinct sentences, allows the translator to pick the most idiomatic phrasing for each case.

[^2]: gettext is the underlying library that powers R's translation system.

### Write full sentences

Generally, you should strive to make sure that each message comes from a single string (i.e. lives within a single "").
Take this simple greeting where I translate "good" and "morning" individually:

```{r}
name <- "Hadley"
paste0(tr_("Good"), " ", tr_("morning"), " ", name, "!")
```

This will pose two challenges for translators:

-   When working with `.po` files, translators see each individual string without context, and they may be in a different order to the original source.
    This can lead either to a poor translation or an expensive journey to the source code to get more context.

        msgid "morning"
        msgstr ""

        msgid "Good"
        msgstr ""

-   Prose is not like code: you can't reliably build up sentences from small fragments of text.
    Even if you can figure out how to do it in English, it's unlikely the same form will work for other languages.

Instead it's better to generate the complete message in a single string using `glue()` or `sprintf()` [^3] to interpolate in the parts that vary:

[^3]: If you're using the "base" style, you could instead write `gettextf("Good morning %s", name)`; `gettextf()` is a version of `sprintf()` that translates the first argument.

```{r}
glue(tr_("Good morning {name}"))
sprintf(tr_("Good morning %s"), name)
```

Then the translator sees something like this:

    msgid "Good morning {name}!"
    msgstr ""

This gives the translator enough context to create a good translation and the freedom to change word order to make a grammatically correct sentence in their language.
We can make the problem more challenging by making our greeting more flexible:

```{r}
greet <- function(name, time_of_day) {
  paste0(tr_("Good"), " ", time_of_day, " ", name, "!")
}
greet("Hadley", tr_("morning"))
greet("Hadley", tr_("afternoon"))
greet("Hadley", tr_("evening"))
```

This would generate the following sequence of translations for French:

    msgid "Good"
    msgstr "Bon"

    msgid "morning"
    msgstr "matin"

    msgid "afternoon"
    msgstr "après midi"

    msgid "evening"
    msgstr "soirée"

Unfortunately this breakdown won't generate correct French.
The three greetings should be "Bonjour" for morning and afternoon, and "Bonsoir" for evening.
There are two problems: good morning and good afternoon both use bonjour (even though French has different words for morning and afternoon; bon après-midi is used as a farewell), and the two word English phrases turn into single French words.

If you were translating to [Mongolian](https://twitter.com/khorloobatpurev/status/1458044704130437124) you'd face a different problem.
While Mongolian uses the same times of day, it arranges the words _in the opposite order_ to English: "Өглөөний мэнд" is morning greetings, "Өдрийн мэнд" is afternoon greetings, and "Оройн мэнд" is evening greetings.

Again, we need to resolve this problem by moving away from translating fragments and towards translating complete sentences.
One way to do that here would be to restrict ourselves to a fixed set of time points and use `switch()` to specify the greeting:

```{r}
greet <- function(name, time_of_day) {
  switch(time_of_day,
    morning = glue(tr_("Good morning {name}!")),
    afternoon = glue(tr_("Good afternoon {name}!")),
    evening = glue(tr_("Good evening {name}!"))
  )
}
```

This works for French (and Mongolian):

    msgstr: "Good morning {name}!"
    msgid: "Bonjour {name}!"

    msgstr: "Good afternoon {name}!"
    msgid: "Bonjour {name}!"

    msgstr: "Good evening {name}!"
    msgid: "Bonsoir {name}!"

However, it's still not a fully general solution as it assumes that the time of day is the most important characteristic of a greeting, and that the day is broken down into at most three components.
Neither is true in general:

-   [Danish](https://twitter.com/T_Norin/status/1457975164008898560) breaks the time of day in two six parts: ("morgen"), pre-noon ("formiddag"), noon ("middag"), afternoon ("eftermiddag"), evening ("aften"), and night ("nat").

-   In [Swahili](https://twitter.com/larsplus/status/1457963302982672386), the greeting varies based on the relationship between the people: "Shikamoo" is for young to old, "Hujambo" is for old to young, and "Mambo" is for young to young.

Greetings are particularly challenging to translate because of their great cultural variation; fortunately most messages in R packages won't require such nuance.

### `sprintf()` vs `glue()`

In R, there are two common ways to interpolate variables into a string: `sprintf()` and `glue()`.
There are pros and cons to each:

-   Using `glue()` requires an additional, if lightweight, dependency, but gives the translator more context (assuming you use informative names for local variables), and makes it easy to rearrange interpolated components:

        msgid "{first} {second} {third}"
        msgstr "{third} {first} {second}"

    On the other hand, putting the name of the variable in the translated string means that you can't change it without updating all your translations, and there's a small risk of it also getting translated.

-   `sprintf()` is built into base R, so is always available.
    The downside is that it can be hard to figure out what the sentinels refer to and the syntax for rearranging components (which uses `1$`, `2$`) is somewhat arcane.

        msgid "%s %s %s"
        msgstr "%3$s %1$s %2$s"

The difference may be more important than you realize -- as mentioned above, some languages (e.g., Turkish, Korean, and Japanese) assemble phrases into sentences in a different order. "I have 7 apples" becomes "7りんごをもっています" in Japanese, i.e. "7 apples [I'm] holding" -- the verb & subject switched places. The reordering of templates in your messages is going to be quite common if you want your messages available in more than a very limited set of languages.

### Un-translatable content

You can use interpolation to avoid including un-translatable components like URLs or email addresses into a message.
This is good practice because it saves work for the translators, makes it easier for them to see changes to the text, and avoids the chance of a translator accidentally introducing a typo.
It works something like this:

```{r, results = FALSE}
# Instead of this:
tr_("See <https://r-project.org> to learn more")

# Try this:
url <- "https://r-project.org"
glue(tr_("See <{url}> to learn more"))
```

Similarly, if you're generating strings that include in HTML, avoid including the HTML in the translated string, and instead translate just the words:

```{r, results = FALSE}
# Instead of this:
tr_("<a href='/index.html'>Home page</a>")

# Try this:
paste0("<a href='/index.html'>", tr_("Home page"), "</a>")
```

Generally, you want to help the translator spend as much time as possible helping you out.

### Skipping translation

Another way to help translators is to not ask them to translate things that don't need translation.
There are three ways to suppress translation extraction, all using comments:

*   Add `# notranslate` to the end of a line to prevent any messages on that line from being
    extracted.

    ```r
    # Don't extract this message
    message("A message for the developer") # notranslate
    ```

*   You can also surround a block of code with `# notranslate start` and `# notranslate end`.
    Any messages inside that block will not be extracted.

    ```r
    # notranslate start
    message("A message for the developer")
    message("Another message for the developer")
    # notranslate end
    ```

    This is useful when you have a block of code that generates messages that are not intended for
    the user.

The same comment works in both R and C/C++ code; in the latter case, note that it must be part of a
normal C comment, e.g. `// # notranslate`.

### Googling

It's worth noting that that non-English messages are often harder to Google because few non-English languages have a significant presence on StackOverflow.
A few suggestions:

-   You can give error messages a unique identifier (e.g. numbering). This may be harder to do for "established" packages since adding identifiers might be a breaking change. It could also be a headache to keep track of which numbers have been taken, e.g. in a context of concurrent PRs incrementing the error numbering in parallel.
-   End users can switch to an English locale mid-session by running `Sys.setenv(LANGUAGE = 'en')`: error messages will be produced in English until they set `LANGUAGE` again.
-   You could write a custom error wrapper that produces the error both in English and as a translation.

### Plurals

In English, most nouns have different forms for one item (the singular) or more than one item (the plural, also used for zero items).
Or, more formally, in English the [grammatical count](https://en.wikipedia.org/wiki/Grammatical_number) has two forms: singular and plural.
So you might be tempted to construct a sentence like this:

```{r}
cows <- function(n) {
  if (n == 1) {
    paste0(n, " cow")
  } else {
    paste0(n, " cows")
  }
}
paste("I have ", cows(0))
paste("I have ", cows(1))
paste("I have ", cows(2))
```

But this doesn't always work, even in English:

```{r}
paste0("There are ", cows(0), " in the field")
paste0("There are ", cows(1), " in the field")
```

Again, we always want to construct a complete sentence:

```{r}
field_cows <- function(n) {
  if (n == 1) {
    fmt <- tr_("There is {n} cow in the field")
  } else {
    fmt <- tr_("There are {n} cows in the field")
  }
  glue(fmt)
}
```

But there's an additional wrinkle here: while English has singular and plural, other languages have different forms like singular (1), dual (2), and plural (3 or more), or singular (1), paucal (a few), and plural (many).
So we need to use a different helper: `ngettext(n, singular, plural)`:

```{r}
field_cows <- function(n) {
  glue(ngettext(n,
    "There is {n} cow in the field",
    "There are {n} cows in the field"
  ))
}
```

`ngettext()` generates a special form in the `.po` file, which shows both singular and plural forms:

    msgid "There is {n} cow in the field"
    msgid_plural "There are {n} cows in the field"

Then the translator can supply any number of translations.
Languages that don't have plurals (e.g. Chinese) only need to supply a single translation:

    msgid "There is {n} cow in the field"
    msgid_plural "There are {n} cows in the field"
    msgstr[0] "田裡有{n}頭牛"

Russian has three forms, so it gets three entries (roughly 1, 2-4, and everything else):

    msgid "There is {n} cow in the field"
    msgid_plural "There are {n} cows in the field"
    msgstr[0] "В поле {xn} корова"
    msgstr[1] "В поле {n} коровы"
    msgstr[2] "В поле {n} коров"

Slovenian and Serbian have four forms, Irish has five forms (learn more at [bitesize Irish](https://www.bitesize.irish/blog/counting/)), and Arabic has six forms.
The rules that define which number get which message are quite complex and encoded in the "plural form" that's recorded at the top of the `.po` file and looks something like `(n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20)? 1 : 2).` However, this is something the translators need to worry about not you, and as speakers of the language they should find it easier to puzzle out the rules.

### Collapsed lists

In English we typically lists of items like "a, b, or c", where the use of the serial, or [Oxford](https://en.wikipedia.org/wiki/Serial_comma), comma being a hotly debated style preference.
European languages follow the mostly same form, although none use the Oxford comma, and they obviously translate "or"[^4].
The [and](https://github.com/rossellhayes/and) package takes care of these details:

[^4]: In Spanish and Italian, the word used varies based on the start of the following word.

```{r, eval = FALSE}
library(and)
values <- c("first", "middle", "last")
or(values)
#> [1] "first, middle, and last"

# lang is normally retrieve automatically from the environemtn
# overriding it here to show what a translation looks like:
or(values, lang = "fr")
#> [1] "first, middle ou last"
```

Which also works will in glue:

```{r, eval = FALSE}
glue(tr_("`x` must be one of {and(values)}"))
#> `x` must be one of first, middle and last
```