Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Right, when I was younger, I was convinced that NUL termination was a reasonable strategy. Learning C in the 1990s it made plenty of sense, even though I was also learning about buffer overflows and underflows.

One of the last things that finally changed my mind was the observation that the length shouldn't live with the text, but with the structure describing the text. Some of you might be laughing now, because that was obvious to you, but I genuinely had gone years without considering that. I'd been imagining a hack like the length of the string lives in a few bytes "before" the text.

Once I was envisioning the mutable string as [length, pointer] itself, that seemed obviously better and I was onboard with abolishing NUL termination in software.



A lot of what C provides is not supposed to be used by application code. The string interface is the bare minimum one should use, but any reasonable application should create its own higher level interface. The problem of the C community is that they never managed to create a reasonable string library, or if one was created it is seldom used. Future standards should introduce a higher level string library to fix this problem.


> I'd been imagining a hack like the length of the string lives in a few bytes "before" the text.

That's normal, usually called a "Pascal string".

As I recall, the C standard makes no assumption of whether strings are null-terminated or not.


> As I recall, the C standard makes no assumption of whether strings are null-terminated or not.

I'm not sure what you mean by assumptions made by the C standard, but it definitely says strings are null-terminated:

> A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.

and

> A string literal need not be a string [...], because a null character may be embedded in it by a \0 escape sequence.

(the second one is noting that if a string literal contains "\0", then it's not a string but contains a string with more stuff after it).


You're right. I was remembering something badly. I found some interesting things while looking into this, though:

C's behavior of defining literal strings as null-terminated character arrays is already described in the 1978 K&R. The word "string" is used in the text of the section, but not in its title, "character arrays". Null termination is mentioned here, as the book works through an example of successively reading lines from standard input:

    `getline` puts the character \0 (the 'null character', whose value is zero)
    at the end of the array it is creating, to mark the end of the string of charac-
    ters.  This convention is also used by the C compiler: when a string constant
    like
           "hello\n"
    is written in a C program, the compiler creates an array of characters con-
    taining the characters of the string, and terminates it with a \0 to that func-
    tions such as `printf` can detect the end:
    
           | h | e | l | l | o | \n | \0 |
    
    the `%s` format specification in `printf` expects a string represented in this form.
There is a hint, immediately prior to this, as to why null termination might have been chosen:

    The length of the array `s` is not specified in `getline` since it is determined in `main`.
(Where `getline` is a function defined in the example, and `s` is a parameter to that function.)

https://en.wikipedia.org/wiki/Comparison_of_Pascal_and_C#Str... has an intriguing comment that seems likely to be related:

> In Pascal a string literal of length n is compatible with the type `packed array [1..n] of char`.

> Pascal has no support for variable-length arrays, and so any set of routines to perform string operations is dependent on a particular string size.

I suspect that I was remembering someone writing that how to represent strings was a live issue at the time of the creation of C, rather than, as I wrote above, being a live issue within C for some period after its creation.

----

On an unrelated note, it's interesting to see that the web convention of fixed-width type for code literals and variable-width type for natural text was already in force in K&R 1978.


Like almost everything in software development it's a trade-off, length + pointer means that your data structures become bigger and that often you will use more registers. That used to matter more than it does today.


I don’t believe it mattered that much even at the time. Having to calculate the length of the string by iterating over it at each string operation was and is much more wasteful and slow. It is simply a stupid decision.


It might sound obvious to you now, but most functional languages conceptually store strings as nil-terminated lists ...


The problem is memory safety and functional languages won’t read into another object’s memory even with a logical bug.

And a null-terminated linked list is different from a C-string.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: