A simple example that shows how we count and iterate over a string by rune and byte.
Unicode can be complex, but it isn't rocket science. Here's a short primer to understand the basics of Unicode and text encoding. Also, read the blog on strings, bytes, and runes, and characters on the official Go website.
Term | Description |
---|---|
String | String is a read-only slice of arbitrary bytes encoded in UTF-8 (absent of byte-level escapes. |
Code point | A numerical value that is mapped to a character. It's dependent on the character encoding eg. ASCII or Unicode. |
Rune | Go speak for code point. |
Byte | 8-bit or 1-byte of a unit of digital information. |
Character | An abstract representation of a symbol. The term character is often ambiguous and it really depends on the context given a character can be represented in different ways. |
Let's use this string "你好世界" as an example. In Go, we can represent it in the following ways:
// Both strings are equivalent.
str := "你好世界" // As UTF-8 string.
str := "\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c" // As bytes using byte-level escapes.
How big is a string? It depends. You can count the size by number of bytes or by number of runes.
fmt.Println(len(str)) // Prints 12
fmt.Println(utf8.RuneCountInString(str)) // Prints 4
See source code for details.
To iterate over a string by byte:
for i := 0; i < len(str); i++ {
fmt.Printf("%x ", str[i])
}
for _, b := range []byte(str) {
fmt.Printf("%x ", b)
}
To iterate over a string by rune:
for i, b := range str {
fmt.Printf("%q starts at position (in byte) %d ", b, i)
}
Pay close attention to the index i
, it denotes the index of the rune in byte units.
See source code for details.
-
Run the program.
$ make run