Go is Weird: Strings

Having done extensive programming in C, I am not particularly spoiled when it comes to idiosyncrasies of a language’s “string” type. Yet, Go’s string types keeps tripping me up — why does it all still have to be that complicated?

There are two answers to this question:

  • Because strings, in a globalized world, are complicated
  • Because the designers of the Go language made some non-intuitive choices

tl;dr: What is a String?

Generally speaking, to a programmer, a “string” is an array of characters.

But Go’s string type is not a string in this sense. It’s not even a UTF-8 string. Instead, it’s an immutable slice of bytes. That’s right: a Go string is not a sequence of characters, it’s also not a sequence of “runes”, but a sequence of bytes.

The bytes in the byte slice can contain anything: their content and format is not restricted, and completely arbitrary. In particular, there is no requirement that the bytes are valid UTF-8. Go makes a strict separation between the data structure itself (string) and the interpretation of its contents (we will come back to that below).

In fact, the only real difference between Go’s string and an (immutable) byte slice is that a string can be transformed into a collection of “runes” (essentially, characters), by one of two built-in methods: either explicitly (using runes := []rune(str)) or implicitly, as part of the loop-range construct (for idx, rune := range str {...}). It is in this transformation, and only here, that the encoding of the information contained in the bytes matters, and where Go requires the use of UTF-8.

The primary source of confusion is that the two most commonly-used operations on Go sequences (namely len(s) and s[i]), when applied to strings, operate on bytes, not characters or “runes”: len(s) returns the number of bytes (not characters) in the string, and s[i] returns the byte at position i, not the character.

What makes this doubly confusing is that, when the string contains only 7bit ASCII characters, both len(s) and s[i] seem to do the right thing.

In a way, the worst of all worlds.

What’s a Character — String Storage and Encodings

The behavior of the Go string type makes more sense when one realizes that Go strings make a strict distinction between the data storage and its interpretation.

Obviously, a sequence of bytes, by itself, has no semantics at all: we need some out-of-band information to interpret the bytes appropriately (the bytes might contain a PNG-encoded image, for instance). Even when we know that the sequence of bytes contains textual data, we still need information about the encoding to break the byte sequence into characters. The problem now is that there is not a single encoding — in fact, multiple encodings coexist (UTF-8, UTF-16, UTF-32 are only the most common).

Go’s string data type tries to accommodate all possibilities by separating the data storage from the encoding: the string type handles the storage, but does not enforce a particular choice of encoding.

Go only expects a particular encoding when converting a string to a sequence of characters (or “runes”). Go provides two mechanisms for doing so:

  • explicitly: runes := []rune(str)
  • implicitly in a for-range loop: for idx, rune := range str { ... }

In both of these cases, Go expects the string to be encoded using UTF-8; invalid characters are replaced by the replacement character \uFFFD (which is usually rendered like this: �).

There are other ways to perform character-level operations on a string variable, which make the encoding explicit: the packages unicode/utf8 and unicode/utf16 provide functions such as RuneCountInString(string) (but not RuneAt(i)!). Also, note that the top-level package is unicode, not encoding!

Go’s rune data type, by the way, is simply a typedef for int32: a type large enough to hold any code point, using any of the common encodings (including UTF-32). It does not have any other special meaning — you can do your arithmetic with runes, if you like to. (In the same spirit, byte is simply a typedef for int8.)

There is one other place where Go mandates UTF-8: Go source files themselves must be UTF-8. This has the curious side effect that string literals (such as: str := "Hello, World") are automatically UTF-8 encoded.

In a similar way, a Go character (pardon: rune) literal (like 'a') is simply a number of type int32. In other words, the three expressions ' ', \0x20, and 32 are all identically equal! Finally, because rune literals are evaluated at compile time, a 7bit-clean expression such as 'a' fits into a byte.

How are strings different than byte buffers?

All this begs the question: why do we have a string type at all — would things not be easier and clearer when everything was handled explicitly as byte buffers []byte?

The differences seem slight. Besides being immutable, string values are also comparable. (Something that byte slices are not — although the bytes package provides a Compare(a, b []byte) function, as does the strings package!)

And the string type supports conversion to []rune, by one of the two methods described above.

Why This Way?

There are two questions that naturally arise:

  • Why do Go strings not enforce an encoding (say: UTF-8) at all times?
  • Why does Go not provide methods to operate on individual characters, only on bytes?

I believe the answer to the first question is the desire to be able to read any text, no matter what its encoding is. Unless the program needs to operate on individual characters, it never needs to know the encoding at all — all bulk string operations (trim, split, append, etc) can be done independent of the specific encoding. Given that, forcing each input string to be converted to (say) UTF-8, and possibly back to its original encoding on output, seems wasteful.

The reason that functions operating on individual characters are missing seems to be in the spirit of the Go language to avoid operations that seem simple, but carry invisible costs. Given the variable-length encoding of UTF-8, the only way to find the ith character in a string is to walk the string. Finding two characters requires walking the string twice. At that point, it is more efficient to walk the string only once, namely to break it into runes explicitly (using []rune(str)), and then operate on the slice of runes.

Bitchn

All that being said, I still find Go’s handling of strings, characters, and encodings confusing and difficult. It all sort of makes sense, but it is not an example of clarity and elegance; one of those instances, where I get the feeling that the designers of the Go language didn’t really think things through to the end. There has to be a better way.

The separation of storage and encoding makes sense. I am less certain that it makes sense to support a string type with bulk string operations (split, append, etc), but without an explicit encoding. In my experience, when working with strings, sooner rather than later I need to operate on individual characters as well, and hence the encoding comes in by the back door pretty quickly, anyway! But my experience may be atypical; I don’t know. Finally, having parallel data structures (namely string and []byte), which are almost, but not entirely like each other, is weird and confusing.

But what I really don’t like is how some critical pieces of information are unnecessarily obscured — unless you are a language lawyer, it is not obvious that []rune(str) requires UTF-8. Should this not have been made explicit (whatever: utf8.StringToRunes(str) or so)? Similarly regarding the for-range loop construct — how is anybody supposed to guess that this operation silently requires UTF-8?

But the price for the worst design must go to the decision to let the two most basic operations for any collection (namely len() and []), when applied to string, operate on bytes, not runes. That is not how a programmer expects a string type to work. It also seems to be getting things exactly backwards: I can’t think of a single relevant use case where I would want to know the length of a string in bytes, or access an individual byte of a multi-byte character (pardon: “rune”, of course). I guess this is a consequence of not enforcing an encoding from the outset: without an encoding, there are no “characters” to index, only bytes. (This is how one strange design decision leads to another.)

This is particularly insidious, because it so often seems to work: as long as you stick to 7bit-clean ASCII. But it will break the moment you encounter “runes” from a wider character set — in other words, Go’s len and [] give you exactly the wrong sense of security. A strange decision.