You're absolutely right, and too few realise this.
In particular people doing IO with small buffers drives me crazy. People unfortunately don't seem to realise how expensive context switches are, and how brutal they are on throughput.
I've seen this so many places. MySQL's client library used to be full of 4-byte reads (reading a length field, and then doing a usually-larger-but-still small read of the following data). I believe it's fixed, but I don't know when. I also remember with horror how t1lib - a reference library Adobe released for reading type 1 fonts ages ago (90's) that spent 90%+ of it's time on the combination of malloc(4) and read( ... ,4) - for tables of a size known at the outset (basically some small table with one entry per glyph, that stored a pointer to a 4 byte struct instead of storing it inline).
Currently I'm hacking on SDL_vnc now and again, and it's full of 1-16 byte reads (seems to make sense at first glance: After all the VNC protocol packets are of a size that depends on values of different fields; but for high throughput networks or local connections it makes the small read/writes totally dominate overall throughput even when the protocol overhead is a tiny percentage of the bitmap data being pushed)
Basically pretty much anywhere where you want to read less than 4K-16K, possibly more these days, it's better to do buffering in your app and do non-blocking read's,so you can read as large blocks as possible at the time...
But the general problem is not paying attention to the number of system calls. People not paying attention to stat()/fstat()/lstat() etc. is another common one (common culprit: Apache - if you use the typical default options for a directory, Apache is forced to stat its way up the directory tree; it's easy to fix, but most people don't seem to be aware how much it affects performance)
In particular people doing IO with small buffers drives me crazy. People unfortunately don't seem to realise how expensive context switches are, and how brutal they are on throughput.
I've seen this so many places. MySQL's client library used to be full of 4-byte reads (reading a length field, and then doing a usually-larger-but-still small read of the following data). I believe it's fixed, but I don't know when. I also remember with horror how t1lib - a reference library Adobe released for reading type 1 fonts ages ago (90's) that spent 90%+ of it's time on the combination of malloc(4) and read( ... ,4) - for tables of a size known at the outset (basically some small table with one entry per glyph, that stored a pointer to a 4 byte struct instead of storing it inline).
Currently I'm hacking on SDL_vnc now and again, and it's full of 1-16 byte reads (seems to make sense at first glance: After all the VNC protocol packets are of a size that depends on values of different fields; but for high throughput networks or local connections it makes the small read/writes totally dominate overall throughput even when the protocol overhead is a tiny percentage of the bitmap data being pushed)
Basically pretty much anywhere where you want to read less than 4K-16K, possibly more these days, it's better to do buffering in your app and do non-blocking read's,so you can read as large blocks as possible at the time...
But the general problem is not paying attention to the number of system calls. People not paying attention to stat()/fstat()/lstat() etc. is another common one (common culprit: Apache - if you use the typical default options for a directory, Apache is forced to stat its way up the directory tree; it's easy to fix, but most people don't seem to be aware how much it affects performance)