Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Designing a better strcpy (saagarjha.com)
127 points by signa11 on June 19, 2021 | hide | past | favorite | 120 comments


I hate to nitpick on something so mundane and superficial, but why in the world are people still writing code like this in 2020?

  while (--len && (*dst++ = *src++))
    ;
Dereferenced post-increments are already confusing enough as-is, and yet here we have 1 pre-increment and 2 dereferenced post-increments happening on top of an assignment in a conditional, all in a single expression. Even as someone who does put an assignment in a conditional once in a while, this still feels 100% unjustifiable to me. It's especially ironic given the premise is that C code has security bugs... if the goal is to avoid that, shouldn't there be even more care taken to avoid this kind of code?

EDIT: For those who don't think the readability can be improved... any thoughts on something more like this? Do we really need a compound assignment with three side effects in a conditional modifying the function arguments to make this readable?

  char *strxcpy(char *restrict dst, const char *restrict src, size_t len) {
    size_t i;
    for (i = 0; i < len; ++i) {
      if (src[i] == '\0') { break; }
      dst[i] = src[i];
    }
    int input_exhausted = i < len;
    if (input_exhausted) { ++i; }
    if (i > 0) { dst[i - 1] = '\0'; }
    return input_exhausted ? &dst[i] : NULL;
  }


I can’t answer why other people might do this, but I wrote this code so I can at least give my rationale for it. In short: it leans into C idioms to convey intent. “--len” in a loop condition is a way to say “do this len times”. “*dst++ = *src++” is quite literally the reference strcpy implementation. Taken together, it clearly and quite concisely represents the main operation of this function: do a strcpy, but only up to len characters. There is some extra bookkeeping that needs to be done for errors and null-termination, and that’s delineated from the core part of the code. In all, I read the function as “bail out early if there is nothing to do, otherwise do the copy operation and then fix up the result”.

Now, of course you might say that my code is prone to off-by-one errors and the like. And it totally is! This is C and you are going to get string handling wrong, guaranteed. When I was writing this I personally ran it through the typical edge cases and it caught an instance where I (IIRC) was writing an extra NUL when I had exactly filled the buffer previously. These things happen and you need to verify the code yourself to fix it beforehand. But, back to the point, I don’t trust your code any more than I would mine. I mean, you’re also breaking out of a loop on an additional condition and reusing the index variable for later logic. That’s basically a big red flag for “there might be bugs here”. There’s really no way to avoid it. I would argue that the most complex part of this code is actually verifying that the transition between the loop and the error handling that happens afterwards is correct, not the copying bit. Adding a few extra variables splits up the logic a bit but it also gives you more loop invariants to keep track of. Anyways, that’s my 2¢.


I'm not talking about correctness though. I'm talking about readability. There are so many (classical!) principles of readability (and debuggability) that are all being violated in that one function:

- Not mixing pre and post increments

- Not altering function arguments

- Minimizing (not maximizing!!) side effects in conditionals

- Putting braces around a loop body

- Having fewer entry and exit points (lots of returns especially make debugging hell)

- Just minimizing needless mutation in general

etc.

To give an example, I have such a hard time figuring out what even 'len' is in your code after the loop, and what it implies. I have to think about whether it's signed or unsigned, whether it might wrap around or not (I actually thought you had a bug until I saw the earlier return), whether it ends as zero or something else, and what its relationship to the pointers is. (e.g., whether they all change by exactly the same amount.) Whereas just using a simple index without modifying the arguments just erases all those problems altogether.

It's one thing to feel they're both equally likely to be buggy (I actually don't particularly agree on that either, but I see what you're saying and think your viewpoint is fine there), but do you find them equally readable/understandable/debuggable too? Surely you can agree it's easier to read code that modifies just 1 local variable and has 1 return with just 1 extraneous increment than one that modifies 3 arguments simultaneously, misuses the natural point of a loop, and then adds several returns to handle cases with only minor variations?

Btw, I'm not even particularly concerned about your ability to write code like this. I'm worried about propagating dangerous practices to other people. IMO, for people (or teams) who can write exceptionally correct code even in an unconventional manner, I just keep my mouth shut and let them do their thing as best as they can... internally. But seeing practices that are known to be error-prone for decades still being exhibited and propagated to us mere mortals who are likely to make predictable (human) errors is what bugs me. Especially when the very goal is to teach people to write safe and secure code!


I just want you to know that I find your comments to be interesting, they're really forcing me to find words for why I wrote the function this way! Oh, and ignore 'userbinator; I'm pretty sure I understand their point but they know better than to be expressing their views like that.

Anyways, to address your points: I suspect we are reading this function at different "levels". You see it as a mess of mutation and increment operators, but it is written that way on purpose! The reason for this are the "idioms" I mentioned above. You might be aware of some in C already, like

  struct foo f = { 0 };
To someone who is unfamiliar with the syntax this is can be confusing: "it's assigning 0 to the struct…what?!" Logically, this zero-initializes the first member of the struct, then implicitly zero-initializes all the other members. But an experienced reader isn't going to read it like that, they're going to read it like "clear out the struct"–they could probably tell you how the C standard makes this work and why this is better than memset, but in their brain when they read it they just know the idiom " = { 0 }" and what it does without going through all that. Similarly, in functional languages you might see code like

   let product = [1, 2, 3, 4].fold(*)
which computes the product of all the numbers. Again, someone can explain how the whole thing works, and * is a binary operator that is being coerced into a closure argument, or whatever, but at some point you just see ".fold(*)" and think "product". At some point you may even prefer seeing ".fold(*)" over

  var product = 0
  for i in [1, 2, 3, 4] {
      product *= i
  }
because it's concise and doesn't leave room for error (did you spot the bug? I've done this more than once. It's easy to find and fix, but still, hopefully it illustrates how these kinds of things happen.) Now, with that in mind, we have some idioms in my code. These are what they are:

  while ((*dst++ = *src++));
I "know" this idiom as a strcpy. I also "know" that the loop invariant on this is that dst and src always point to the next character to copy: this is true at the start of the loop, while it's running, and at the end except at that point you're done copying so you don't want to run this loop anymore. I don't futz around with the increment operators, I just have internalized this to be true and read it as such. Similarly,

  while (--len);
is just an idiom to run a loop len - 1 times (I made a typo in the first reply, heh). At the end len is zero because it's also the condition. With that done, I think the algorithm I implemented becomes clear:

1. If the buffer has no space, return early. (You can see this is a special case bailout–you'll see why it exists in just a second.)

2. Do a strcpy. But, before we copy any character, we need to also check to make sure you haven't run more than len - 1 times (Why? Again, see below). This check actually has to happen "first" so put it in front of the &&, to make sure we bail before writing a character.

3. Ok, so the loop is over. Since we have two conditions, this means there are two possible cases: either we hit the limit, or we finished copying the string. Actually there are three, since both could happen at once. If len is zero we ran out the buffer, so we need to null-terminate. Because we got rid of that pesky empty buffer case at the beginning, and we've copied len - 1 bytes, we have a guaranteed invariant here: we need to write one last NUL byte, and we have the space for it. Because we used the strcpy idiom, we know that dst has been pointing at the next spot to write to all this time, so we use it now. And src is pointing at the next character we want to write, so it lets us conveniently fill that case where the string fits perfectly, because we can figure this out from the input. The other side of the if is simpler, we simply finished copying without problems, so we can return dst as-is.

Ok, that was a lot of words, but when I am reading it these stuff just comes automatically. This is why I brought up correctness: readability and correctness go hand-in-hand here. The readability principles you bring up are generally good ideas, but I chose to eschew them here precisely because the idioms give much powerful invariants that I think they are just more readable. Your function breaks out of the middle of a loop to ensure it doesn't overrun the string, mine just naturally rolls it into the loop condition. Your program increments dst past the end, goes "whoops", then goes back and writes in the NUL byte; mine only moves forwards and ensures that dst points to the place I want to write next from the moment the function starts all the way to the points of return. There is one edge case where I don't write a NUL (which 'saurik also noticed because he also tried to implement this function), but I bail out of that one immediately so the rest of the function can assume that it can write to the end of the buffer.

So, in summary: I know how the increment idioms work. I don't see loops or mutation or side effects, I see the code as "this is the high-level thing that this sequence of characters will do", and then I structure my function around it so that the guarantees I know it can provide apply to my program.


Thanks for fleshing this out. Before I address the larger point, lemme just mention one thing regarding the Your program increments dst past the end, goes "whoops", then goes back and writes in the NUL byte bit: it's not really going "whoops", it's just (a) copying everything except the null terminator, then (b) tacking on the null-terminator.

The reason I bring this up is that your characterization actually misses the single most important aspect of that code. Notice that in your own implementation, it's not obvious in the last branch whether the buffer will end up null-terminated or not. To infer that, the reader will have to jump around the code while performing the following mental gymnastics to execute the program backwards: (a) !!len == true in the final branch, (b) therefore --len > 0 earlier, (c) therefore that condition was inactive in the loop, (d) therefore the loop terminated because of the other condition (i.e. because dst had a NUL written into it), (e) therefore therefore dst is NUL-terminated because it is not advanced afterward. I'm not saying people can't do this, but it requires people to focus on the code and concentrate to figure it out. It's essentially a puzzle they need to solve, and that's before maintenance. There's no indication to the maintainer that this logic is implicitly there, so you're really just hoping that every step of this multi-step chain stays intact during maintenance. Contrast this with the index-based solution, where it is blatantly obvious that the returned string is NUL-terminated... the return value is &dst[i] and the very line before it literally tells you dst[i - 1] = '\0' when i > 0! So you can verify the most important property of that function by simple inspection.

So now that I have that out of the way, regarding your larger point: what you're saying is essentially "this is a composition of well-understood primitives", i.e. the primitives are

  while (--len)
and

  while (*dst++ = *src++)
and your brain "parses" them as the high-level operations they are, and then simply reads them as compositions of those high-level operations, rather than as a bunch of pointer manipulation operation.

I guess it explains why you wrote it this way (thanks for that). What I'm saying is that when you force yourself to combine primitives like this, the solution ends up being more complicated than it needs to be.

If you still don't see this the same way, imagine what the logical continuation of what you're saying would be for n >= 2 primitives:

  while (--len && (*dst++ = *src++) && *++ptr && (*foo-- = *bar--) && ...) ;
Surely at this point it's obvious that the whole is more complicated than the sum of its parts, right? Maybe one way to see it is that the composition of these n primitives doesn't just have n termination conditions; it has an exponential number (at least 2^n) of them. That's objectively exponentially worse. The only way you (or or your future reader) can dismiss differences in each of those states is by actually spending time analyzing the structure of the problem beforehand; it's by no means a given for an arbitrary problem. And n = 2 isn't exempt from this complexity growth.

Stepping back a bit, and this is a little bit tongue in cheek, but I'm reminded of a joke that seems kind of relevant... it's been years, but if I recall, it went something like this:

A mathematician and an engineer are presented with a problem to solve. Each of them is separately brought into a kitchen, where they see a curtain catching fire. They're asked to put it out. They think for a while, frantically looking around for something useful, and suddenly see there's a pot of water boiling on the stove. They look around, find some mittens, grab the pot, and dump the water on the curtain to put out the fire.

The next day, they're asked to put out a similar fire again, except this time the stove is off. The engineer sees the pot of water, starts looking for a mitten, but then notices the stove is off. So he just picks the pot up with his bare hands and dumps the water on the curtain. The mathematician instead just notices the stove is off, then turns it on to boil the water so he can finish the rest of the problem with the solution he figured out yesterday.

It's exaggerated, but I think a similar idea applies here. Essentially, you don't want to treat everything like a nail just because you have a hammer. Even in a hypothetical world where we could assume the primitives are perfect in isolation and your composition is 100% guaranteed to be bug-free code, this kind of "have idioms -> will use" approach to writing software would merely be guaranteed to give correct code with an upper bound on the required complexity. It would by no means guarantee the final solution is anywhere as simple or readable as it could or should be for that problem, and in this case, IMHO it just isn't.


Oh, that actually makes more sense! To be clear, when I said your code goes "whoops" I didn't mean it was doing anything wrong, it just seemed like you were correcting for a poor choice in bounds, but now I understand that you were actually aiming to copy one fewer character in all cases. So in this case you were actually maintaining the invariant "I need to write a NUL byte" across both cases, which makes sense because you see this part as being the most important attribute of the code, so much so that you interrupt the string copy to ensure it goes through that specific case at the bottom. So I guess the direction I went with this is that once the loop is over the string is either fully copied, or we've filled the buffer and need to write the NUL, and you've gone with "I've copied everything but the last byte, now I should do that (except for that empty buffer case)".

I also want to say that I'm not advocating "forcing" together primitives when they don't work. In this case I had a general idea of which ones might work, then I did a bit of mildly clever rearranging so that the invariants I chose to verify correctness happened to align with the idioms. When you can do this, it's great: you get the benefits for free. But in a lot of cases this isn't going to work, and then the way these idioms fail is that you can't just make something new and confusing out them, you just can't use them anymore. They're like functions from a library: when they work, they're great. But if your needs can't match exactly then you really can't use them at all. So they're really an additional tool in your toolbelt, rather than a general purpose replacement. If you give me slightly different requirements, I might not be able to use the idioms anymore, and my code might look like yours. That's totally fine, but it just means that I need to go through the whole "does this loop work correctly? What are the invariants?" stuff again because I lost the ability to offload this to the idiom.

Finally, once you actually write programs this way you'll notice that monstrosities with multiple primitives essentially don't get written. The reason for this is that you can only really use multiple idioms if their invariants happen to line up, and this becomes substantially less likely as you combine them, providing a sort of selective pressure towards readability. In this case they did match and you'll notice that I essentially handle no more termination cases than you do. If the invariants line up correctly the termination conditions will collapse together into fewer cases; when they don't you will know immediately because you'll do stuff like write a long if-else chain at the end to "fix" the odd cases or on one branch add an extra loop to "finish the job" because the invariant didn't actually hold in that case.

So, I think I agree that you can make monstrosities if you push the idioms too far, I think they are exceptionally good at "breaking" in that case and you'll quickly realize that you've gone down the wrong path. Just like you can learn to recognize the idioms, you'll "know" when they can apply and you won't force them into places where they shouldn't be used. If done correctly you'll get a concise and "elegant" application without very obvious signs of trying to "undo" what you just did at the end because the idiom didn't match the problem.


There are so many (classical!) principles of readability (and debuggability) that are all being violated in that one function

In other words, you're just repeating cargo-cult "I don't like how this looks" dogmatism and rejecting it outright, instead of making any attempt at actual understanding.

I strongly recommend that everyone should try an APL-family language, even if only a little bit, to realise that there's a whole world of programming beyond the lowest-common-denominator tripe that's common these days. The mind needs to be exercised.


If len is 1 at the start, won’t this loop not execute one time?

I thought predecrements are executed immediately?


Oops, yes, I forgot to add the "-1" to the end of that. The blog post has the right invariant:

> this function will copy the smaller of strlen(src) or len - 1 bytes from src to dst


This looks straightforward to me and probably to many C programmers. It is natural to write this as well as other styles you can come up with. Nitpicking is a bigger problem which distracts the attention to the real issue.


I don't know, it could be written more verbosely, but then, this is just so idiomatic. Not sure writing it out makes it much more legible.

ETA: yes, your version reads really well. Question though: does it require an extra multiplication and addition for every "[i]" in the code below, or are compilers smart enough to optimise that these days? It seems to me that for sufficiently dumb compilers, your "nice" version would be slower.

    for (i = 0; i < len; ++i) {
      if (src[i] == '\0') { break; }
      dst[i] = src[i];
    }


You can just try it and see [1], but tl;dr, no, that problem doesn't occur here.

The elements are power-of-2 sized, so at worst there would be a shift instead of a multiplication. On x86 there's 'lea' which could encode the shift-and-add in a single (fast) instruction. (Actually, even 'mov' can encode this. See [1].) And one nice thing about simple linear indexing (with power-of-2-sized elements) is that the compiler could use index registers on x86 (si/di), which can sometimes result in even better code than with pointers.

But note that there isn't even a need for a shift to begin with here (let alone multiplication), because it'd be a no-op... since sizeof(char) == 1. So the whole concern is moot.

(If these were generic iterators in C++ I'd have coded them differently, both due to the reason you're mentioning and also because I shouldn't assume they'd be randomly indexable to begin with.)

[1] https://gcc.godbolt.org/z/dqxj5WzTd


I disagree, I think that is just the way idiomatic C code looks like and should look like. A post incremented dereferenced pointer for example is an idiom, it is so common that at some point you will immediately recognize it for what it is. These idioms keep the code concise, which has a lot of value in itself, since they make it easier to recognize what the code is all about instead of figuring out code that essentially expresses the same intent but does it in some non-idiomatic way. They are like chords in music, standard positions in chess, etc.


> A post incremented dereferenced pointer for example is an idiom, it is so common that at some point you will immediately recognize it for what it is.

That's not my point though. I don't really have much of a problem with

  *dst++ = *src++;
in itself; while it doesn't achieve the platonic ideal of a programming language, I agree it's an idiom that can and should be learned for C. I'm talking about everything beyond that. The fact that that is getting mixed in with so much else in that one-liner. We have a predecrement in conjunction with an assignment in the conditional of an empty loop. Surely that's not a "pattern" at that point? If it were, it would surely be a buggy one without the return preceding it!

Do you find it easy to tell what the preconditions and postconditions of that loop are? Like do you find the fact that a length of zero would error obvious? Do you not feel it's a potential trap for people maintaining it, or trying to reuse the code elsewhere? And if I gave you some combination of src/len/dst, would you find it easy to figure out what the variables would be after that loop and verify that they're correct? I definitely don't.


I read this “idiom” together with --len as lodsb; stosb; cmp; loopnz. But with time it lost it’s purpose, because if you look at any modern char-star interface implementation, you’ll see pages of multi-arch simd-aware code. And it will never pass a review in a decent team, because neither readability nor performance (the author, with all respect, couldn’t even spell its meaning out correctly N times in a row, which we all couldn’t for some N, give or take one, and weather). This idiom is just a dead relict on a venerable C programmer’s home altar.


Homage to K&R? Hard to unlearn that style.

How about reading the STL source code? I know it's not deliberately obfuscated, but sometimes feels that way.


A lot of it is deliberate because they have to avoid name collisions thanks to C++ header files being incredibly stupid


Thank you for mentioning this! As idiomatic as the original loop might be it still forces someone unfamiliar with C to think about it carefully, on the other side this version is trivial to read and understand. Code should be always designed with readability in mind as we spend much of our time reading and understanding code (if an idiom makes hard to understand something I would probably consider it a bad idiom and avoid the style).

EDIT: And no, there is also people that write code like you did regardless the language ;)


Personally, I try to make my code as easier to read as I can. Not only is faster to read to others, but also faster to analyse when you're trying to fix a bug.

At the end of the day, today's compilers are good enough to translate the code to the machine in the most efficient way.


I guess this is so that something like this can be asked in interviews to test how lucky the candidate is on that particular day and hour. Otherwise everybody would pass the interviews.


but why in the world are people still writing code like this in 2020?

To be frank, because some people still value intelligence and understanding.

I have been working with C for decades and that code is basically second-nature to me. It reads very naturally. Perhaps it should be you and all the anti-intellectuals who should instead consider why they find it "confusing", and take some personal responsibility to actually try to learn the language instead of complaining that it's "not readable".

It's especially ironic given the premise is that C code has security bugs... if the goal is to avoid that, shouldn't there be even more care taken to avoid this kind of code?

On the contrary, mental laziness is a bigger source of bugs than anything else. Making code more verbose does not help, and your version has so many more "moving pieces" that I'd say it's harder to understand.


> In the case where src fits in dst, it will return a pointer past the NUL byte it placed; otherwise it returns NULL to indicate a truncation.

It is amazing to me how personal these preferences are ;P... like, I'd be much happier with an API that always returns the location of the NUL byte on success; and, if the string gets truncated, then it instead returns dst+len (the address of the byte past the end of the buffer). This allows for chained constructions that provide efficient strcat-style semantics with easy error propagation, such as this example which concatenates three strings (which I honestly hope I got right... I'm giving reasonable odds to Saagar telling me I've coded a buffer overflow by accident somewhere ;P):

    char buf[X]; // for any X, even 0!
    char *cur = buf;
    char *end = buf + sizeof(buf);
    cur = strjcpy(cur, str1, end - cur);
    cur = strjcpy(cur, str2, end - cur);
    cur = strjcpy(cur, str3, end - cur);
    if (cur == end) goto fail;


Not seeing anything wrong, but it’s C so who knows ;) Thankfully, the primitive I (and memccpy) provide makes writing your wrapper easy and efficient, as opposed to all the other functions which don’t compose at all. (From my phone) I think this might work?

  char *strjcpy(char *dst, const char *src, size_t len) {
      char *result = strxcpy(dst, src, len);
      return result ? --result : result + len;
  }


FWIW, part of the fun of strjcpy is that it is also much easier to write than strxcpy (and watch as I go further and further out on a limb with sketchy C that is likely wrong, lol... I did test it, at least! ;P):

    char *strjcpy(char *restrict dst, const char *restrict src, size_t len) {
        for (;; ++dst, --len)
            if (!len || !(*dst = *src++))
                return dst;
    }
(edit) Oh, I have an even cuter implementation (which might look "too clever" but actually demonstrates something important about the function)! Essentially, what makes strjcpy so "pure" is that the NULL return is really a "special case" in strxcpy that you're having to "undo" in that wrapper, whereas the semantics of strjcpy--which may sound a bit weird--are mapping directly to what naturally terminates the loop: running out of space on one of the two inputs. This purity then gets taken advantage of by the caller to get such easy call chaining and error propagation, as one of these loops can "continue through" into the next loop without any adaptation logic.

    char *strjcpy(char *restrict dst, const char *restrict src, size_t len) {
        for (; len && (*dst = *src++); ++dst, --len);
        return dst;
    }
(edit) Ok: one difference is that this implementation of strjcpy (this isn't intrinsic to the return value surface I described: just to the gloriously simple versions in this comment; your adapted version, for example, is "fine") doesn't put a NUL byte in case of truncation... though, I personally am not at all sold on doing that: I want to firmly fail the operation, rather than try to "use" the truncated data :/. Adding that special case would still result in strjcpy being simpler than strxcpy (and doesn't break its semantics advantages: just make sure to return the address of dst+len, not that extra NUL), but it isn't quite so amazingly simpler at that point ;P.


> doesn't put a NUL byte in case of truncation

which was an explicit design goal for the author of the article.


Which is the reason I wrote that paragraph... right? ;P I could repeat the content of that paragraph--which explicitly states... no no no: wait, you're tricking me!--but instead I will use this opportunity (as I had run out of edit time on my comment a while back) to connect it to the original thought of how personal some of these thoughts are and then to provide the trivial patch that adds this functionality to my "elegant" implementation of strjcpy without making it as complicated as strxcpy:

[redacted, as actually, omg... I immediately noticed it didn't work! Saagar's implementation of my function does have that property, of course. fixing it for real requires two of statements, one before the loop and the other after it, which is demoralizing as strxcpy arguably asked for semantics which were impossible to do well for 0-sized buffers because of this termination step]

If you make the loop more complicated (more like my first copy) you can avoid the duplicate comparison, but I continue to find the elegant second copy much more elucidating, as this demonstrates visceral how the termination is a duplication of effort when chaining, as you are effectively concatenation the loops but then tacking on what should be a single termination step at the end into every intermediate loop. Really, you should just not... ooo: almost repeated that previous paragraph again ;P.


Did you mean to check for (cur > end)?


No: if the function succeeds, the address of the NUL byte will always be !=end (as it must be inside of the buffer, which end is not); whereas, in the case of string truncation, cur will be equal to end, as the function returns dst+len (which is end); and, if that error had "already" happened during a previous call, it will get propagated through the next call as end-cur will be 0, causing the next call to immediately fail (even for a 0-length string, which is an important corner-case) and return cur+0 (which is still end).


An even better solution:

https://www.digitalmars.com/articles/C-biggest-mistake.html

With the lengths of strings known, copying becomes fast, safe, and trivial.


It always boggles me why people went with null-terminated strings in C instead of just putting the length of the string in the first 4 bytes and then the string after that.

And just generally doing that for all arrays. That way if you're passed ONLY a pointer to the array, you can always tell how long it is without requiring the caller to tell you how long it is, which seems to be a common theme among C functions. And that also allows your string to contain null characters, which is useful in many circumstances.


Embedding the length means you cannot slice strings. To get a substring one has to reallocate.

Of course, null termination also means that slicing is restricted to being a tail.


ideally the length should go with the pointer, not with the string.


I feel like to an extent string_view solves that, but in C++ at least you end up having to juggle lifetimes under the hood.


Because this is how strings in C have always been done and this is how the language and pretty much all libraries treat strings.


Not everyone is going to be happy about 8 bytes overhead for every buffer.

Also, not every C-style API will migrate. If you consume one of those, you may still have to count the length, which costs cycles.


> Not everyone is going to be happy about 8 bytes overhead for every buffer.

Then prefix it with four bytes (which, if you omit the trailing NUL, adds three bytes in total). If your C strings are longer than 2^32 bytes, you're probably doing something wrong.

I also should point out that memory is typically cheap. Cache is more precious, and I'd expect on balance adding a length would pollute the cache less than scanning through the whole string unnecessarily.

But I agree with your second point:

> Also, not every C-style API will migrate. If you consume one of those, you may still have to count the length, which costs cycles.

Similarly, the article started with this text:

> Like them or not, null-terminated strings are essential to C, and working with them is necessary in all but the most trivial programs.

You can do things nicely in your own code, but you still need NUL-terminated strings when dealing with existing code. And dealing with existing code is often the reason to pick C...


I would assume the people bothered by an increase in byte-sized overhead are not the ones that are using 2^32 length strings.

About the cache, I think for small strings, scanning the whole string may still be faster.

Though adding a four-byte length prefix would make certain functions like strlen almost go away.


Four bytes are still overhead. Also, you often need to keep the capacity of a string. Another four bytes.

The C string APIs are like the lowest denominator of all use cases. When you know string lengths, you can avoid quadratic strcat() or other performance pitfalls with the existing APIs. On the other hand, if the APIs forced everyone to keep string lengths, performance/memory/convenience would be affected for certain use cases. Memory is not cheap when you deal with many short strings or when you do embedded programming.


> Four bytes are still overhead.

I hedged with "typically" but I'm a little skeptical about the environments in which this is a problem. Embedded stuff that's severely memory constrained probably doesn't deal with that many strings.

But you could use a variable-sized length prefix as another commenter suggested to have no overhead on short strings. Its size would be decided at allocation time so you're not shifting the string. You could indicate it via high bits of the length or low bits of 4-byte-aligned string pointers. It's an extra branch or table lookup or the like on access though.

> Also, you often need to keep the capacity of a string. Another four bytes

Nonsense. There are exactly no situations where you need to keep a capacity with this scheme and not with traditional C strings.


For two standard formats in our field, you sometimes need to keep billions of short strings in memory. Many of these strings are 2 bytes or several bytes in length. The parser in C packs them as NULL terminated strings. The C++ and Java parsers use the std::string or the string type. They use a lot more memory. Even a 4-byte overhead would double the memory. Yes, you can have a smart way to keep short strings to waste less or even no memory (one of the two formats is using this trick), but it makes code harder to read and hurts performance due to the extra check/conversion.

As to the capacity field, we sometimes have to keep it around. The question is: do you keep it inside the string struct/memory block (like std::string) or let users handle it? We have varying preferences in different cases. It is hard to design APIs optimal in all cases.


Do you mind sharing the field / formats? I'm curious.

I think situations like this are special enough that they shouldn't drive discussion for general-purpose types. fwiw, some ideas for what you're describing based on what understanding I can have from your two-paragraph description:

* If many of billions of text strings are that short, there must be a huge number of duplicates. If they have similar lifetimes (e.g., all freed with the parsed file handle) and are immutable or mutations are rare and can follow a "get_mut()", I'd consider aliasing them with an intern table. (You could do that for all strings or for just the strings under a certain length.) There's some CPU and memory overhead from the hash table but it'd make duplicates use no RAM in a way that's orthogonal to the representation of a particular string.

* Or you could have an 8-byte type that can represent a pointer or a 7-byte text string. With a general-purpose allocator, the low bits of a pointer are guaranteed to be zero, and many of the high bits are also effectively unused (all the same) depending on platform.

(Either of the above might be what you're referring to with "a smart way to keep short strings to waste less or even no memory".)

* On most platforms, malloc returns pointers with a minimum alignment of two words or so and thus a similar minimum allocation size. If you're on a 64-bit platform (must be, with billions of 2+ byte strings in RAM), that's 16 bytes. A lot of waste. A custom allocator might be better. It doesn't have to be anything complex. Again depending on lifetime and mutability, you could use a simple arena (bump allocator). Its handle could be indexes that are valid across reallocation or they could be stable pointers by just making a single upper bound allocation and depending on the OS's lazy minor page faulting to keep the unused part from being backed by real physical memory.

* C++: if you pass these to interfaces that take an absl::string_view or the like rather than const std::string&, you have a lot more flexibility in your representation without copying/reallocating on use.

* Java: GCed languages double memory requirements to begin with for the GC to work efficiently, and Java historically uses a UTF-16 string type (maybe there's an optimization in latest versions?), so it's almost hopeless. Hard for me to reconcile "we need to save RAM" with Java at all. The most RAM-efficient way to store them would be outside Java's heap (native code or mmaped region) but then you have to deal with non-idiomatic handles and likely copying/allocating (generating garbage) on use, yuck.

> As to the capacity field, we sometimes have to keep it around. The question is: do you keep it inside the string struct/memory block (like std::string) or let users handle it? We have varying preferences in different cases. It is hard to design APIs optimal in all cases.

Keep it around long-term, as in you have lots of mutable strings at once rather than having them for a short time then "freezing" them into a more efficient immutable type that can assume length=capacity, can use interning and arenas, etc? Then you want to make sure you're not paying 8 bytes for that capacity field, as you can easily end up with having the caller keep it due to padding. (That'd be a more error-prone API too.) You could have your handles be a single word that uses unused pointer bits for size of capacity and length fields and points to a capacity+length+data allocation. That's likely what I would do if efficiency really matters.


You could also prefix it with a varint : in this case it would not require any extra byte compared to the NUL terminated string when the length of the string is less than 128 chars.


> I also should point out that memory is typically cheap.

Not in the early 70s, no.

Even in the late 80s I was writing code very conscious of every single byte that could be saved anywhere.


Saving 8 bytes, and saving a few cycles by scanning rather than counting... is an argument for using NUL-terminated strings as an optimization in specialized circumstances.

In the common case, the extra 8 bytes and the cycles saved (if any) don't matter — and are not worth the sacrifice of making it harder to program correctly.


> the extra 8 bytes and the cycles saved (if any) don't matter

You wouldn't last in embedded development with that sort of attitude. When you're nearly out of RAM you can't afford to waste overhead of fat strings.


I was using Turbo Basic, Turbo Pascal, Turbo C++ (with C++ collections), and Clipper on MS-DOS, with 640 KB of memory.

It was just fine, then there are those that nowadays even toy with C++17 on C64.

“Rich Code for Tiny Computers: A Simple Commodore 64 Game in C++17”

https://www.youtube.com/watch?v=zBkNBP00wJE

Really, unless we are talking about PIC and AVRs with like 4KB, we are optimizing for the wrong target.


You can't afford to waste cycles on O(n) strlen either.


Not every embedded system is memory constrained these days.


Indeed. And besides, didn't I account for precisely that scenario? Memory-constrained embedded development is a specialized situation that could call for using NUL-terminated strings.


Actually, it's 7 bytes (you can remove `\0` at the end), and that's if you insist on being able to exceed 4GB with your strings.

Personally, depending on the use case, I'd be willing to have strings limited to 256 bytes (no overhead), 64K bytes (1 byte of overhead), and 4G bytes (3 bytes of overhead). Though combining them all might be quite the nightmare…


you can also have something akin to MIDI encoding with higher bit reserved to mark that next byte is a part of length field, too. up to 128 bytes with no overhead, up to 16kb with one byte overhead, and so on. Such a detail would be easy to hide in the library. But I wouldn't be surprised that alignment and other troubles would eat any possible gain.


> But I wouldn't be surprised that alignment ... troubles would eat any possible gain.

It's string. Alignment shouldn't be a problem. Rather alignment issues are for the pointers to that string.


Won't work with UTF-8.


Works with any encoding and arbitrary binary data. The last byte of the length field doesn't have the high bit set, and whatever follows after that is completely irrelevant.

But still needless complexity for this.


Sure it is, the length is out of band. In fact, you could even use utf-8 to encode the length!


Don't forget supporting small strings (<= 7 bytes) in the pointer itself using tag bits. No allocation or dereferencing required.


Microsoft's MFC used to do this in their first version. But later ditched it. Maybe they found it not worth the trouble of checking for this condition everywhere. Increase in code.


Thse days, in C++, this is routinely done in various std::string implementations (which are both null terminated and have a separate length). In fact the standard was updated to explicitly allow this.


The proposal does not take away any C features.


If someone implemented that, you might even call it betterC...


Maybe as a joke they could call it D since that's the letter after C...


D as in D language... https://dlang.org/


Rant ahead.

(which value to return)

This always bugged me in C interfaces. All these metrics are accumulated in the scope of your qweasd(), why not at least take a pointer to a “struct strcpy_result” instead and return them all with it, dammit, stack space is almost always free. After two decades of C, I turned to jitted scripting and hopefully will NEVER look back (but thanks for the xp). Low-level bit-fiddle style programming outside of very restrictive embedded tech is a distilled schizophrenia of tiling sharp irregular shapes, trying, failing, vapid engineering, byte economy, and finally losing a larger part of the ephemeral benefits that you never needed (or were able to get) in the first place. (C++ being a drug that doesn’t fix the issue, but at least reduces voices in your head to “just” a heavy form of OCD).

discussion on losing 4/8 bytes per string in this thread

Insanity.


> strlcpy... it’s not standard so it doesn’t satisfy 5 either.

It's a de facto standard supported on *BSD/macOS/Android/Linux(w/ musl libc) and a number of other commercial systems, it's also on track for standardization in POSIX (Issue 8).

https://www.austingroupbugs.net/view.php?id=986#c5050

https://www.opengroup.org/austin/docs/austin_1110.pdf

While it might not meet your arbitrary set of requirements, it already has decent adoption.


Yikes, I hope they don’t add it, that would make people want to use it even more. Given its surprising performance characteristics it’s usually not what you want.


It's looking very much like it will, but even if that wasn't the case, it's already widely in use and available in OS libc's and easy to copy, and it has been since OpenBSD first introduced it over 20 years ago. I also believe your performance claims are exaggerated.. and strlcpy does exactly what people want and expect in most scenarios.


I've seen a lot of strlcpys, and of that I think I have seen maybe one place where the return value was used. I think if you asked people why they used strlcpy, 95+% would say "security" and then list out the characteristics my strxcpy has, rather than "I want snprintf-like behavior with a size_t return" which is what strlcpy is. I would not be surprised if most people got the time complexity of the function incorrect because of this.


If the return value of strlcpy isn't used, the compiler can optimize the extra inefficiency away. (I don't know if any do, though.)

This is similar to how strlen(x)>0 calls get optimized to *x by clang and likely other compilers.


The concern with performance seems a little overwrought. strlcpy is only slow in the bad case where it truncates, which is ideally not the common case. I've never heard or seen of a performance bottleneck traced to a strlcpy in the hot path.

If you really cared about performance, you'd be using nothing but memcpy with careful length tracking. Regardless of algorithmic runtime, any function that examines bytes as it copies will be slower than a length based copy.


I think one use for this kind of thing is if you are doing some sort of logging you may have a fixed size buffer, both to keep overhead down (you don't want to allocate extra memory) and also prevent overly verbose logs from spamming output. In this case, waiting to calculate the length of a 10 MB string just so you can fit it in a 1 KB buffer is unacceptable. For your second point: not necessarily! memcpy would be slightly faster if you are keeping the length around, but if you're dealing with arbitrary strings you wouldn't have that. Calculating the full length beforehand is just a no-go as I mentioned above, and using memchr first to get the length and then memcpy is not going to be faster.


Example where etcd’s logger was calculating log size and suffering from a performance issue as a result: https://github.com/etcd-io/etcd/issues/12835


Use snprintf with a precision specifier.


Migrating to a pointer-size-pair string representation would be a better use of one's time.


And only works if you don't rely on any third-party libraries that take normal C strings.

Which is to say, is unrealistic for many programs.


Depends on much they value security, even std::string has an extra null for c_str() calls.


Send the pointer from the pair to 3rd party and not the pair itself. You lose the efficiency in 3rd party, but still retain your gain in your own code. Better than before.


You can use a library that also adds a trailing \O. I used it to interface C++.


I swear part of the problem is with C there is a cargo cult prohibition against passing small structs by value.


The cargo cult goes beyond that.

The belief it was created alongside UNIX from the start, when it was used to port UNIX V4 into high level language.

Micro-optimizing each line of code as it is written, "because it is fast", without even bothering to use a profiler.

Even though lint was created alongside C to fix already known programmer faults using the language, in 1979, the belief that only bad programmers need such kind of tooling.


> Micro-optimizing each line of code as it is written, "because it is fast", without even bothering to use a profiler.

With the CPU, MMU and OS architectures of that period, it wasn't particularly hard to infer what was fast without profiling it. The slow rise in complexity at all 3 levels now makes it hard for even extremely experienced close-to-the-metal programmers to understand what will be fast or slow without a profiler. Times do change, in fact.


Yeah, except the world has moved on from PDP-11, no need to keep asserting something that isn't no longer true.


No, the problem on unix v4 (at least I think that's the version I'm talking about) was that the C compiler did not support passing structs - whether as arguments, return values, or any other expression. So, they didn't do that, because it wouldn't work, and it was less hassle to work around it than fix the compiler. The cargo cult is when people keep avoiding struct passing, even though that compiler deficiency has been fixed for decades now.


That isn't what I was talking about, rather that C only appeared on season 4 of the UNIX movie, when many think it was part of the original cast from the get go.

So it gets a cargo cult status like UNIX was only possible because C was designed to make it happen and other bogus pocus that ignores almost 15 years of previous work in high level languages for systems programming.


> rather that C only appeared on season 4 of the UNIX movie, when many think it was part of the original cast from the get go.

Ah, that makes sense, but that's a different cargo cult than Gibbon1 was complaining about.


Hence why I started my reply to Gibbon1 with "The cargo cult goes beyond that.".


In other languages we create types for just about anything, and like you say, it's strange that we don't do this more in C.


Big problem with C is the standards committee just flat out refuses to add an array type to C. It's deranged because if you had first class arrays it'd be a lot easier to generate code that takes advantage of SIMD instructions.


You mean “add a second array type?”

C has an array type, it’s just a bit wacky, and it doesn't play by the same rules as other types.


Yeah because requiring &array[0] instead of array is so much typing, UNIX V4 would never have been finished.


You want to break existing code? If you made that change, you wouldn’t even be able to pass string constants into functions, because string constants are arrays (and not pointers, as some think).

The C standard committee has a general policy of not breaking code. If you want something like C with a better type system and less of the broken stuff (hello, arithmetic conversions) you can try Zig or one of the others. Once you’re okay with breaking code, you call it a different language, and go in trying to fix lots of things.

Unix V4 predates C standardization anyway.


Who mentioned anything about breaking existing code?

I already decided in 1992 that I don't want to use a broken language, unless when obliged by university work or work requirements.

Unfortunely I have to use software written by people that don't share that opinion and apparently WG14 also has a general policy that improving C safety doesn't matter.


> Who mentioned anything about breaking existing code?

You did, when you proposed writing &array[0] to get the address to the first element of an array.

Or maybe I misunderstand what you wrote, in which case you could clarify.

> Unfortunely I have to use software written by people that don't share that opinion and apparently WG14 also has a general policy that improving C safety doesn't matter.

What would WG14 be doing differently if they didn’t have this “policy”? Isn’t the obvious explanation that their top priority is to maintain compatibility with existing code? Is this explanation not satisfactory?

It seems absurd—in the extreme—to expect a standards committee to break large swaths existing code to improve safety, in a language with such a large amount of legacy code such as C. I would expect that if the standards committee chose to do that, compilers wouldn’t implement it and users wouldn’t use it. If you are going to break existing code, why not use a different language?


> You did, when you proposed writing &array[0] to get the address to the first element of an array.

That would only happen in two cases:

1 - If C arrays had been properly designed in first place,

2 - Since it is too late for 1 unless we happen to have a DeLorean parked around the corner, by introducing a new datatype for new code, just like happened to bool and complex on C99.

This is basic compiler design stuff that surely the WG14 heads are capable of.

If not, the Microsoft team from Checked C project can gladly explain to them how to extend C with new array types, while keeping compatibility with existing code.

I really don't get your "maintain compatibility with existing code" hot discussion.


> This is basic compiler design stuff that surely the WG14 heads are capable of.

What I don’t understand is why WG14, specifically, should be doing this. In my mind it seems obvious that WG14 just shouldn’t care about fixing deep issues with C—issues that would require high amounts of engineering effort to adopt. A new array type certainly falls in this category. C99 introduced VLAs—not even really a new array type—and these were so poorly adopted that VLAs were made optional in the next revision. You might also consider Annex K, which is also quite poorly adopted.

If you want to use Microsoft’s checked C, you can use it. If you want to use MISRA, you can use that. If you want to use formal methods, there are various subsets of C that you can tackle with formal methods. If you want to use asan, tsan, valgrind, or switch to Zig or Java or Rust, all of those options are available.

The problems with adding a new array type to the language are that:

1. If WG14 makes a change like you suggest, it will be poorly adopted, meaning that it's a poor fit for the standard, (just based on the history of the adoption of things like VLAs and Annex K, this is the most likely outcome).

2. You can just use a different language for new projects, or extensions / tooling for existing projects.

In other words, it’s a low value change with a high cost.

> I really don't get your "maintain compatibility with existing code" hot discussion.

It’s because your original comment was so short and vague. Feel free to ignore the responses, since you’ve since clarified.


WG14 is responsible for defining what C compilers are supposed to compile as language standard, available everywhere that C is supposed to be available and many times the only option availalbe.

Your suggestion to use something else is a bit hard on platforms where C is the name of the game.

I am mostly C clean since 1992, unless required otherwise, not everyone is free to chose their tooling, nor what language the software they use, e.g. UNIX kernels or IoT junk, was written on.


I think zig just blew a big gaping hole in the argument that the language needs to be tied to legacy code.

Because zig can compile old C code and Zig into the same binary. Which means there is no reason you couldn't compile old legacy C code and new better C code into the same binary as well.

The C standards committee need to be fired and replaced with people who aren't willfully holding the language back.


Hm, I can understand where you’re coming from but I think you’re asking for something which provides negligible value and would require a significant amount of engineering effort. Additionally, the way in which you’re expressing your opinions about the C language is needlessly inflammatory. What’s most unreasonable about your comment is the way it attacks people.

It seems like your issue here is really nothing more than the names of languages—your position is that there should be a language with clean semantics, interoperability with C, etc. and this language should furthermore be called “C”. A language exists—Zig—which seems to meet all your requirements except one. The only missing requirement, as far as I can tell, is that it is not called “C”. Let me know if there is something I misunderstand about your position, because I can’t think of a different way to interpret it.

I can see where you’re coming from—but to be honest, unless I am misunderstanding your position, your position seems completely unreasonable, and again, the personal attacks are pointlessly inflammatory.


It's not a cult, its just that the cases where the risks of passing by value would be worth any perceived advantage are so few that it just doesn't make sense to even consider it.

It's not like it's a flimsy tribal based claim, the guidance is solid.


It does suck that the usual 64-bit calling convention limits the size of strict passed by value to 64-bits :/.


At least linux AMD64 ABI allows passing (and returning) 128 bit structures by registers, which is perfect for the ptr/len case.


You could store two 32 bit fields and cast the second to a pointer when needed, and the first treat as a length.


Are there any recommended libraries for doing this, if I'd like to migrate my C codebase?


You might look at SDS.

https://github.com/antirez/sds


Referring to strcpy:

> we can only use it if we know our destination buffer is smaller than our source buffer

Should be the other way around, surely?


LOL yes, thanks for catching that. I’ll fix it when I’m at my computer :)

Edit: edited. I also had a messaging from past me that I had forgotten about:

  <!-- I made a mistake somewhere, didn't I? -->


Absolutely,

That's not the only discrepancy.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/m...

Note the following: "The memccpy() function does not check for the overflow of the receiving memory area." "If copying takes place between objects that overlap, the behavior is undefined."

The strxcpy he provides at the bottom doesn't look better at all. I'm not sure where the author got that function. I found some better variants of the proposed strxcpy function with bounds checks and that provides overflow detection.


I wrote strxcpy as an example of a function that satisfies the requirements that I posted at the top. In doing so, it becomes a useful routine for other kinds of needs, such as this one: https://news.ycombinator.com/item?id=27564004

The reason I omitted the specific verbiage you quoted is that they apply equally to all the functions. My function has an important bounds check, but (like all of C) trusts that the parameters you provide to it are correct. There is no additional checking done because doing so in C is not tenable. If implemented in a standard library it may be useful to add additional heuristics to detect invalid cases, but they are fundamentally best-effort and not worth showing in an example.


> While string truncation has its own issues, it is often a fairly reasonable fallback.

Seemingly swimming against the tide here; a strcpy() that _automatically_ truncates strings is a worse, and huge hidden risk.

Yes, buffer overflows from user data are very, very bad.

But a buffer overflow has a clear fingerprint as leaning on undefined behaviour; tools exist to detect these (eg. valgrind), and the fix is well understood.

Defining strcpy to include truncation is the wrong choice, as it's plain dangerous in most cases. It reclassifies the bug as valid program behaviour, where it's riskier and its fingerprint is harder to detect.

For example, a function which looks up credentials for a login name limited to 16 characters. During lookup, a buffer truncates and by default the code is now doing lookups against the wrong login name.

The better solution is a strcpy() which accepts a length and is _expressly not defined_ for cases where the output buffer would overflow; ie. asserts or aborts (which now has scope to be detected at compile time for some cases)

It's now clear where bug is, and a developer must preempt overflow for cases where it's possible and handle it.

Perhaps folks have different experiences, but mine is that copying a string and _wanting_ to truncate it is so incredibly rare and is the cases that should be explicit, not implicit.


Failing perceptibly composes better than failing loudly. It's easy to go one way:

  char *strxxxcpy(char *restrict dst, const char *restrict src, size_t len) {
      char *result = strxcpy(dst, src, len);
      assert(result);
      return result;
  }
but going in reverse is difficult.


But that misses my point; the bug is that as a developer it's too easy to not write the correct check -- like exactly this one.

The library function presented to most developers (especially a novice one) should reflect the most common case, not the more-easy-to-compose case.


I would use ssize_t instead of size_t.

Mixing unsigned and signed is seriously broken in C, and hence better to stick with signed.


> Mixing unsigned and signed is seriously broken in C, and hence better to stick with signed.

Mmmm .. maybe in this specific case (didn't look). But if you meant this as general advice, then one should keep in mind that unsigned overflows are specified but signed overflow is UB (barring maybe the very latest version of C standard); because of that, unsigned division in many cases is trivially optimised to less complex ops, etc.

There are many advantages in using unsigned.


Would it make sense to extend the C standard to have a sensible string type?

/me ducks


It would, just like a sensible array type and proper enumerations as well.

Obviously WG14 doesn't care about it.


I don't want to use any of the str*cpy functions, all of them are either braindead or missing in most libcs. At this point I'm all in on snprintf(%s, foo).


I hate to do this, but: read the post. I mention snprintf specifically: https://saagarjha.com/blog/2020/04/12/designing-a-better-str.... It is neither efficient nor general-purpose as a string copying routine.


Something feels wrong about using memccpy to copy strings... ever since I saw bugs where people used memcpy incorrectly.

And is there a wchar_t version of memccpy?


memccpy≠memcpy, even though their names are similar. While the function is somewhat generic, I would bet 90+% of its use is going to be to copy null-terminated strings.


Right, maybe I'm just reacting to the name. I've had to educate many novices about using memcpy with incorrect lengths... either 1 short (no null), or even sizeof(src) and wondering why only 4 or 8 bytes got copied.

Now if I steer them to hot new memccpy, they may hear "memcpy" and run off: "memcpy? Sure, I know that!"


I expected to read about SIMD and copying 8 bytes at a time in 64 bit registers, but I guess that is a compiler optimization




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: