How to implement strings

The C programming language defines a string as "a contiguous sequence of characters terminated by and including the first null character". Since the character \0 marks the end we often call this zero- or null-termination. In C programs this means a string is either char* or char[n]. Historically, this representation actually predates C and seems to come from PDP-11 assemblers. The primary advantage of this representation is space efficiency. Also, you can do some tricks like splitting a large string into multiple small ones by inserting nulls.

Nevertheless, other programming languages often use different representations. What else is possible?

Pascal Strings

An alternative representation of a similar age are Pascal strings, where we encode the length of the string as a prefix. While the size of the prefix could vary, let us assume word size here. This is enough for any string that fits into your RAM. Assuming a 32bit architecture, this wastes an additional 3 bytes per string compared to C strings: Word size is 4 bytes but we do not need the 1 byte zero terminator. (If you are on a 64bit architecture, you probably do not care about a few bytes here and there, so I'm talking about 32bit) Since most memory allocators will round up the length of strings, there will usually be no difference in memory use. In C we could declare such a Pascal string like this:

struct pstring {
   size_t length;
   char data[];
}

This uses the flexible array member feature introduced by C99. The struct does not contain a pointer. The data is next to the length in memory.

In contrast to C strings, Pascal strings always know their length. Where C strings need to count the chars up to the zero-terminator, the length can be read in O(1) in Pascal. This enables implicit boundary checks. Of course, this costs some runtime. However, this prevents serious bugs like buffer overruns. That it is clearly worth it as a default.

Capacity

In addition to length, we could store an additional attribute "capacity".

struct pcap_string {
   size_t length;
   size_t capacity;
   char data[];
}

While length is where the string ends, capacity is where the allocated memory ends. This means we have the invariant that length <= capacity. Of course, this costs an additional word per string, but the use case where we add more and more data to a string is more efficient. Instead of reallocating memory, we can just increase the length while capacity allows.

Indirection

Another idea is to introduce an indirection.

struct i_string {
   size_t length;
   char *data;
}

The trick we gain here is that multiple string objects with different length values can refer to the same data. That even works if referenced data overlaps. For example, referring to a substring is a common operation. In this case we can adapt length and data accordingly. In the following example i_string is represented by a box. The box on the left refers to the full string "HELLO WORLD!". The second box on the right uses a length of 5 and thus refers to the substring "WORLD".

img/string_indirection.svg

The tricky part is that it only works safely if the data is immutable. If we modify the data we cannot see who else refers to it. Immutable strings are not unreasonable. Java also uses them, for example. Immutability is good in general once concurrency comes into play.

Slices

Now we combine the two tricks of capacity and indirection. We must store the capacity together with the raw memory, while the length belongs to the reference.

There is one more trick we have to use to make the combination of the two techniques feasible. Consider the scenario that we have a substring and append to it. Clearly, we do not want to overwrite parts of the original string. Also, when we refer to a substring, we must not access the capacity, because at the position is just more of the original string. To resolve these problems, a string must have the information if it refers to a raw data or something with capacity. Thus, we add a flag is_raw to distinguish between the two cases.

struct ic_string {
   size_t length;
   union {
     struct chunk *data;
     char *raw;
   }
   bool is_raw;
}
struct chunk {
   size_t capacity;
   char raw[];
}

It is a reasonable optimization to pack the is_raw flag into the length field. We can assume that we will never handle strings which are larger than half the addressable memory. If we do that, then ic_string is only two words.

This combination of capacity and indirection describes well how slices in D work. We can efficiently append to strings due to capacity and we efficiently use substrings due to the indirection. A slice can be mutable in D which leads to unintuitive behavior. However, D defines string as const(char)[], so the data is immutable.

Memory Management

If your language is low-level enough that you care about memory management that should influence the design of strings as well.

Reference counting is great for strings, because there can be no cycles. Of course that comes with overhead. You need an (atomic) reference counter in each chunk. Also, there cannot be raw data references because that would make it impossible to access the reference counter. This implies a data structure like this:

struct rc_string {
   size_t start;
   size_t end;
   struct chunk *data;
}
struct chunk {
   size_t refcount; // update atomically!
   size_t capacity;
   char raw[];
}

Naturally, a garbage collector handles all the burden generically. With manual memory management, you could hide the reference counting inside the data structure.

Even more advanced

There are more advanced data structures for strings. For example, ropes are designed for text editors. The string is represented as a tree to make changes in the middle efficient. Then there are gap buffers, zippers, and more. However, none of them are suitable for general strings where we might have many small ones.

One clever optimization is to store small strings directly instead of allocating and referencing a chunk. On a 64bit architecture a pointer is 8 bytes (characters). With clang std::string has a size of 24 bytes (three words) and strings up to 22 bytes are stored without allocating memory.

Unicode

So far we have only explored the data structure to manage strings. What do strings actually represent though? Is it ASCII text? Arbitrary Unicode? UTF-8?

img/letters.jpg

ASCII text is implicitly small and efficient. In some embedded use cases, you can probably get away with it. Once you provide a user interface that is not good enough today and Unicode is often a requirement. Now there are two options:

  • UTF-8 is space efficient because in the case of ASCII data it is just as small.
  • UTF-32 allows for random access to Unicode code points. Although that argument quickly becomes void when you have to handle graphemes or even grapheme clusters.

In my opinion, UTF-8 is the best default. Whatever you do, do not provide a generic "length" attribute or property. There are two relevant lengths of Unicode strings: memory size in bytes and number of grapheme clusters. Design your API in a way that users must select this explicitly. Counting code points or graphemes is usually not useful.

Once you support Unicode, you want to provide convenience functions. However, Unicode is a deep rabbit hole and I will not even try to provide an overview here. Instead, I will only give you a glimpse of the depth of the rabbit hole: When you uppercase or lowercase letters, it actually depends on your locale. For example, I usually becomes i, unless the text is turkish in which case it should become ı (dotless I). Eevee wrote more about dark corners of Unicode like this

Even if you only care about English text, there’s more than one Latin alphabet in Unicode! Is “x” equivalent to “𝗑” or “𝘅” or “𝘹” or “𝙭” or “𝚡” or “x” or “𝐱”? What about “×” or “х” or “⨯” or “ⅹ”? Ah, sorry, those last four are actually the multiplication sign, a Cyrillic letter, the symbol for cross product, and the Roman numberal for ten.

And when you are done with Unicode, you might also have to handle mixed encodings. Text files and IRC logs may easily contain Latin-1 and UTF-8 mixed together.


Photo by Amador Loureiro on Unsplash.

Thanks to Someone, geocar, and nickpsecurity for their feedback which improved this article.

Related articles: What does High-Quality Font mean?, Faster than C, and Accidentally Turing-Complete.

There are many alternatives to C strings. Here we explore the design space.