Skip to content

Commit 3079006

Browse files
authored
Merge pull request #428 from sshanks-kx/unicode
add clarity and examples to unicode processing
2 parents 79c1d5f + f4b3a50 commit 3079006

File tree

1 file changed

+54
-17
lines changed

1 file changed

+54
-17
lines changed

docs/kb/unicode.md

Lines changed: 54 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -5,31 +5,40 @@ keywords: byte, character, kdb+, q, text, unicode
55
---
66
# Unicode
77

8+
Unicode text can be stored in symbol, byte list and character list (string) datatypes.
89

10+
Since the data is simply a sequence of bytes, any Unicode format can be stored.
11+
However, it is best to use an encoding such as UTF-8 or GBK that extends 7-bit ASCII, i.e. a single byte in the range `00``7f` means the same thing in ASCII.
912

13+
The display console should have the matching code page set or you will not be able to view the data correctly.
14+
For example, if you store in UTF-8 format, ensure that your code page for the display is also UTF-8.
1015

16+
## Examples processing Unicode data
1117

12-
Unicode text can be stored in symbol, byte list and character list (string) datatypes.
13-
14-
Since the data is simply a sequence of bytes, any Unicode format can be stored. However, it is best to use an encoding such as UTF-8 or GBK that extends 7-bit ASCII, i.e. a single byte in the range `00``7f` means the same thing in ASCII. kdb+ will load a script with such encoding, but it will not load other formats. Note that if using these encodings, avoid having a byte-order-mark prefix on the data.
18+
### Storing UTF-8 in a char vector
1519

16-
The q language itself uses only 7-bit ASCII. For example, the statement `2+3` should be given as the three decimal bytes 50 43 51, as in:
20+
The two Chinese characters "香蕉" each use 3 bytes in UTF-8.
21+
In this example, the two chinese characters are stored in a char vector, which is then shown to using six 1-byte characters (i.e. 2 x 3 bytes).
22+
[Comparison](../ref/match.md) with the original UTF-8 characters return true.
23+
Contents are printed in octal format, showing the 6 bytes.
24+
When printed to stdout via [`-1`](../basics/handles.md#file-stdout-stderr), the UTF-8 representation of the characters are shown.
1725

1826
```q
19-
q)`char$50 43 51
20-
"2+3"
21-
q)value `char$50 43 51
22-
5
27+
q)t:"香蕉"
28+
q)type t
29+
10h
30+
q)count t
31+
6
32+
q)t
33+
"\351\246\231\350\225\211"
34+
q)t~"香蕉"
35+
1b
36+
q)-1 t;
37+
香蕉
2338
```
2439

25-
Fixed-width Unicode formats cannot be used, since for example, in UTF-16, `2+3` would be the six decimal bytes 50 0 43 0 51 0, and q does not recognize this:
2640

27-
```q
28-
q)value `char$50 0 43 0 51 0
29-
'char
30-
```
31-
32-
The display console should have the matching code page set or you will not be able to view the data correctly. e.g. if you store in UTF-8 format, ensure that your code page for the display is also UTF-8.
41+
### Storing data in tables
3342

3443
Table and column names should be plain ASCII.
3544

@@ -58,14 +67,18 @@ sym name text ..
5867
bananas 香蕉 "\351\246\231\350\225\211\350\210\271\346\..
5968
```
6069

61-
Display with `-1` to show formatted text:
70+
Writing to stdout with [`-1`](../basics/handles.md#file-stdout-stderr) shows the formatted text:
6271

6372
```q
6473
q)-1 text 0;
6574
每日一蘋果, 醫生遠離我
6675
```
6776

68-
Example assignments using the C interface:
77+
### Using external interfaces
78+
79+
Sending non-ascii data can be done using the various programming interfaces, such as C or Python.
80+
81+
The following example using the [C interface](../interfaces/capiref.md) connects over TCP and sets two variables, each being char vectors representing UTF-8 strings.
6982

7083
```c
7184
int main(){
@@ -76,3 +89,27 @@ int main(){
7689
}
7790
```
7891

92+
## Using Unicode scripts or statements
93+
94+
kdb+ will load a script with such encoding, but it will not load other formats. Note that if using these encodings, avoid having a byte-order-mark prefix on the data.
95+
96+
The q language itself uses only 7-bit ASCII.
97+
For example, the statement `2+3` should be given as the three decimal bytes 50 43 51, as in:
98+
99+
```q
100+
q)`char$50 43 51
101+
"2+3"
102+
```
103+
Using [`value`](../ref/value.md) to evaluate the statement `2+3` results in 5:
104+
```q
105+
q)value `char$50 43 51
106+
5
107+
```
108+
Fixed-width Unicode formats cannot be used, since for example, in UTF-16, `2+3` would be the six decimal bytes 50 0 43 0 51 0, and q does not recognize this:
109+
110+
```q
111+
q)value `char$50 0 43 0 51 0
112+
'char
113+
```
114+
115+

0 commit comments

Comments
 (0)