Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My favourite C string function is snprintf:

• It takes a buffer size and truncates the output to the buffer size if it's too large.

• The buffer size includes the null terminator, so the simplest pattern of snprintf(buf, sizeof(buf), …) is correct.

• It always null-terminates the output for you, even if truncated.

• By providing NULL as the buffer argument, it will tell you the buffer size you need if you want to dynamically allocate.

And of course, it can safely copy strings:

  snprintf(dst_buf, sizeof(dst_buf), "%s", src_str);
Including non-null-terminated ones:

  snprintf(dst_buf, sizeof(dst_buf), "%.*s", (int)src_str_len, src_str_data);
And it's standard and portable, unlike e.g. strlcpy. It's one of the best C99 additions.


My favourite C string function is asprintf. It's trivially safe, since it allocates an appropriately sized buffer, and it promotes correct programming by allowing developers to rely on "if I have a string then of course it was allocated by malloc and needs to be freed appropriately".

It's not part of the C standard, but it's trivial to ship around an implementation with your code.


Trivially safe? It can fail. As with snprintf, it's easy to forget to check the result, but at least with snprintf the result is always a safe string. With asprintf, not checking the result or doing it incorrectly will lead to undefined behaviour. From the man page:

> If memory allocation wasn't possible, or some other error occurs, these functions will return -1, and the contents of strp are undefined.


This is a dumb bug in the implementation. It should set dest to null on failure.


To be fair, dereferencing a null pointer is also UD in general rather than a guaranteed abort. (Some platforms may provide stronger guarantees, of course; many prevent pages from being mapped at 0x0 as a security mitigation.)


It's worse than this, as even on platforms where actually reading from 0x0 is a guaranteed abort, dereferencing a NULL pointer in C is still UB, meaning the compiler can assume it won't happen and optimize the program accordingly.

To take a rather convoluted example, if you dereference the pointer and then call a function that does a NULL check then writes to the pointer at some offset, it's possible that the compiler will in-line the function, then ellide the NULL check (since you've dereferenced it, the compiler assumes it's not NULL), then remove your dereference if it didn't have side-effects, so now the write goes through without any check. Granted, it would have to be a write to a massive offset to actually hit an allocated page, but I'm sure there are similar scenarios that are more realistic.


Well, yes, you need to check return codes. I file that under "trivial"; it's true of almost every function.


Although I agree with your general idea that asprintf() is the smarter choice, I think it's a stretch to call it trivially safe.

I think most people would call its API safe but not trivially safe, where trivial means "when I see that function called in other people's code, I don't need to pay attention to that call because it can't do anything crazy like cause undefined behaviour."

After all, if you include "as long as you check" in your definition of trivial, it is also trivial to check the parameters to strcpy() if you are using it right. And yet here we are, in a discussion about how that's a risk because it isn't used right.

If asprintf() terminated the program when it failed instead of leading on to undefined behaviour when unchecked, I'd call that trivially safe in a more pragmatic way. If you're going to ship a function for portability anyway, that's what I'd recommend. And in fact, that's what is used in software like GCC, called xasprintf() there. It returns the pointer or exits the program on allocation failure.


If asprintf() terminated the program when it failed instead of leading on to undefined behaviour when unchecked, I'd call that trivially safe in a more pragmatic way.

Ugh. Maybe in some contexts that's ok, but some of us write code which handles memory allocation failures with a bit more finesse than abort().


In most contexts aborting on malloc failure is OK, and preferable to trying to handle it gracefully, which has caused lots of problems, including security problems. On Linux you need to be running in a non-default configuration for malloc to be fallible in the first place (other than in trivial circumstances like attempting to request exabytes of memory in a single call).


Yeah, that's one of the things I don't like about Linux.


And some of us includes me.

But at the point where you're able to handle all memory allocation failures usefully, you're almost certainly doing something non-trivial to recover.

For example aborting some requested transaction, pruning items from your program's caches, delaying a subtask to release its temporary memory, or compressing some structures. At that point there's nothing "trivially" safe about a memory allocation.

Probably there are bugs in those recovery actions too. Even the obviously most simple recovery action of propagating an error return to abort a request: If that's a network request, just returning the error "sorry out of memory" is potentially going to fail. So you need recovery from the recovery path failing.


It's also slowest of all the string functions. Additionally, providing the NULL as the buffer argument does all the work of creating the actual string you're going to copy (in the case of using multiple format characters) and then discarding it. So you have to do the format operation twice, doubling the cost of any string operation.

Oh and it's not compatible with UTF-8.


What makes it incompatible with UTF-8?


Normally, UTF-8 works with C strings no problem, but it's possible that a multi-byte UTF-8 encoded codepoint is only halfway copied when it reaches the capacity and throws the NULL at the end.


Why is snprintf slow? I am surprised that it would be slow especially when compared to methods like asprintf that allocate the buffer.


I forget the exact reasoning now, but I remember it being about 10x slower than memcpy or strncpy. I think the main reason was because of the need to parse the format string.


Even then, printf and scanf are typically faster (and not even by a little bit, by a lot) than C++ iostreams formatted output, even though iostreams gets all the formatting information at compile-time, while printf has to parse the format string.

On the other hand, if people start to use snprintf in that particular form as a safe way of string copying, compilers could pattern-match this and substitute a direct implementation.


The modern C++ way of formatting strings is with std::format, or the external fmt library. It's faster than printf (and certainly streams!) while having convenient format strings (unlike streams) and optional compile time format parsing, combined with c++ type safety.


Everyone complains about streams, yet since Turbo C++ 1.0 for MS-DOS they have served me well for the kinds of applications I delivered into production with C++.


Isn't the biggest iostream bottleneck the forced sync with cstdio? You can turn that of at program start.


With that enabled it's comically slow, but even without the stdio syncing (and even operating on stringstreams) it's still much slower. For simple one-off formatting it's hard to notice, but once you write a parser for some engineering file formats (most of which are still ASCII) and have to read a few million floats here or there it becomes a really stark difference. For parsing floats specifically there are some implementations around which are optimized more, though some of the "fast and clever" ones are less accurate.


Oh I see, I thought you were comparing it to the printf style methods but compared to methods that do not take a format string that makes sense.


snprintf works most of the time but it can fail, and people almost never check the return value. For example it will always fail if you attempt to print a string >= 2GB. If that happens the output buffer may remain uninitialized (depends on the implementation) and you're at risk for a Heartbleed-like scenario.


I'm curious, what type of application would require printing strings over 2GB in size?


(Intentionally crafted) large metadata in a video file (which can be very large anyway, so 2GB extra doesn't stand out). A string field in a (machine-generated) JSON/XML dataset. And like tomohawk says, lack of a mechanism that prevents accruing large blobs (for example not limiting the size of an incoming HTTP header in a web server).


The one that plans to print much shorter strings, but accepts arbitrary strings.


Binary file contents are often strings, right? If you're using snprintf() to copy strings, as the thread was suggesting, there are valid reasons to try to copy such strings (though they would be few and far between).


A bug elsewhere could lead to this


Note that snprintf returns the number of bytes that would have been written if the dest buffer were large enough, not the number of bytes actually written. I've seen a few projects misunderstand that and write code like this:

    for (i = 0; i < ...; i++) {
        offset += snprintf(
            dest + offset,
            sizeof(dest) - offset,
            "%s...",
            str[i]);
    }
That will cause a buffer overflow: if iteration n gets truncated because the dest buffer fills up, iteration n+1 will write past the end of the buffer.


snprintf is underlying most logging modules I've done (logging to memory / file / network / console...) - I've been thinking about doing custom formatting routines but there's surprisingly little need for them.

You probably know this, but sizeof is not a function. I prefer the easier to type

    snprintf(buf, sizeof buf, ...);


For anyone curious, this is something that's been discussed on HN at length as a result of a lkml thread

https://lkml.org/lkml/2012/7/11/103

https://news.ycombinator.com/item?id=9629461


Linus disproves his point. First he claims that sizeof behaves like a function, then in the next breath, realizing the flaw in his logic, proceeds to describe and excuse the counterpoint: sizeof(*p)->member.

This is classic Linus--too emotionally invested in a preference. Except in this case it's particularly pointless and unjustified.

sizeof is an operator. Period. The point of not using parentheses is to continually drive that point home. It's praxis. Of course, it's not unreasonable to prefer using parentheses. And there's a middle ground: most C styles nestle function identifiers and opening parentheses in a function invocation, whereas they require a space between operators and binary operands. So if you prefer using parentheses, whether all the time or just in a particular circumstance, you can do:

  sizeof (*p)->member
or

  sizeof ((*p)->member)
That's not entirely consistent. sizeof is a unary operator, and style guides tend to prefer nestling unary operators while spacing binary operators. But nobody is trying to be pedantic here. The issue is readability, minimizing typos, and dealing with the fact that the sizeof operator, while defined as and behaving exactly like a unary operator, doesn't look like one.

Also, it's worth pointing out that not only does the C standard itself literally define sizeof as a unary operator, all the code examples in the standard put a space between sizeof and its operand. It's a stylistic convention, but hardly arbitrary.

By contrast, there are other constructs, like _Generic, where the code examples do NOT use spacing. _Generic is a specialized construct altogether, but syntactically it behaves somewhat like a macro, and it's customary to style macro invocations like function invocations.


That was a great comment, thanks a lot for the clarification!

Another point that I've been missing from Linus' post is that sizeof is special also in that its argument is never evaluated. sizeof launch_the_missiles() will never launch the missiles.

Yet another specialty is that sizeof applied to an array has no array decar. if char buf[1024];, then sizeof buf is 1024, not a pointer size like 4 or 8.

To me, it's not like a function at all. It's more like an assembler macro. I usually agree with Linus and learn a lot from his posts, but here I think he's unsuccessfully trying to find arguments for his (IMO, misleading) stylistic preferences.


My reading of his argument is that sizeof is (compile time) function that maps expressions or types to their storage size.

From that lens, C has a design mistake. I'm inclined to agree.


The + and - operators do exactly that with pointer operands, but nobody is claiming that they're behaving like functions.

In C++ most operators could end up invoking an actual, user-defined function at runtime. But, again, nobody is claiming that they therefore behave like functions.

I get the logic--if you squint really hard you can analogize sizeof to a function. It just doesn't work, as evidenced by his own example. C isn't LISP. It's C. It has a unique grammar with distinct classes of constructs. sizeof is a unary operator, parses exactly like every other unary operator (tokenization conflict with identifiers notwithstanding), which is quite different from how function calls are parsed, particularly wrt parentheses. sizeof has some unique characteristics, but every operator has unique characteristics; that's why they're operators, as opposed to some functional languages that try to subsume everything into function-like syntax.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: