I think the most important lesson of this article is in this sentence:
Why this code was written this way instead of just checking against the ID by linearly iterating over the Components, I have no idea.
There's another comment here that shows their JPEG decoder only supports 1 or 3 components. Using a hash table for at most 3 entries is astronomical overkill! The field for the number of components in a JPEG file is only a byte - so even if you were to support more than 3 components, it would be 255 at most, and sane JPEG files are unlikely to have more than 4 in practice. So fixed-size arrays are the perfect solution. I have a very good idea why it was written like that, however: the mere availability and ease of use of abstract data structures naturally leads to their overuse by what seems to be an overwhelming majority of developers, and the eagerness to "futureproof" and add unnecessary complexity is also very common.
Overcomplexity creates bugs, and (leaky) abstractions hide them. Even if the complexity (and bug) isn't visible in your code, it's still there. If the initial code had been reviewed by me, it most certainly would not pass.
I am saying this as someone who has actually written a JPEG decoder --- it is not as difficult as it may sound, and I highly recommend giving it a try --- and also witnessed (and fixed) the messes that people get in with overcomplicating things. Perhaps that slight "architecture astronaut" mentality comes from starting with a higher-level language; I started with Asm and my preferred language is usually C, and I haven't seen the tendency to overcomplicate much in others who did the same.
The reason the Components were in a hash table was that they would then be checked against the component ordering in a “Start of Scan” section to make sure all the components in the SOS section are in the expected order.
For comparison, this is the corresponding code in my JPEG decoder. I didn't even bother using a loop:
if(read_byte()!= cids[0])
data_error("SOS scan first component must match");
hts[0] = read_byte(); /* high nybble: DC table index; low nybble: AC table index */
if(read_byte() != cids[1])
data_error("SOS scan second component must match");
hts[1] = read_byte();
if(read_byte() != cids[2])
data_error("SOS scan third component must match");
hts[2] = read_byte();
>I have a very good idea why it was written like that, however: the mere availability and ease of use of abstract data structures naturally leads to their overuse by what seems to be an overwhelming majority of developers, and the eagerness to "futureproof" and add unnecessary complexity is also very common.
To give some context, iterating over an array of three small items takes worst case roughly 100ns(cache miss), best case 1ns(l1 cache hit). Most common hashmap implementations (like std::unordered_map) dynamically allocate items so they're stored randomly in memory, giving roughly 300ns worst case(3 cache misses), 3ns best case(3 l1 cache hits) for iterating over them. So using a hash map is going to be 3x slower without even accounting for the time spent hashing, and 3x more likely to hit the worse case due to requiring three cachelines still in the l1 cache for best performance as opposed to just one. If regularly adding/deleting items, using a std::unordered_map would literally be orders of magnitude slower due to using malloc/free for each item added/removed.
Or even better yet, use a language with a sensible SSA implementation that can identify local only variables and allocate them on the stack (like golang).
That technique is useful _and_ independent of what I suggested. Ie you can do it whether you have garbage collection or not.
Though it's less important on languages with compacting garbage collectors, because there allocation can be just a pointer bump, which is pretty similar to what you do for stack allocation.
See also this Scheme implementation which allocates everything on the C stack:
> Previous Schemes for implementing full tail-recursion when compiling into C have required some form of "trampoline" to pop the stack. We propose solving the tail-recursion problem in the same manner as Standard ML of New Jersey, by allocating all frames in the (garbage-collected) heap. The Scheme program is translated into continuation-passing style, so the target C functions never return. The C stack pointer then becomes the allocation pointer for a Cheney-style copying garbage collection scheme. Our Scheme can use C function calls, C arguments, C variable-arity functions, and separate compilation without requiring complex block-compilation of entire programs. Introduction IEEE Scheme [IEEE90] requires that all functions be properly tail-recursive, in order that tail-recursive programs not require an unbounded amount of stack space. Several Scheme compilers have targeted the C language [Bartlett89], because C is an efficient systems programming language which is available on nearly ev...
Do keep in mind that some languages like Scheme (because of call/cc) and Haskell (because of laziness) in general don't have a single stack. See eg https://en.wikipedia.org/wiki/Parent_pointer_tree for what Scheme needs to do.
Of course, allocation of variables with hierarchically nested lifetimes (ie stack allocation) is still an important and common special case that your compiler/interpreter should probably recognize and handle well.
No. unordered_map buckets are linked lists of nodes (key + value pairs). So you have one indexing into the bucket array to retrieve the list pointer, and then you may or may not have to chase an arbitrarily long linked list. But the nodes are necessarily allocated separately, so (assuming the bucket array is in cache [previous post should say 4 cachelines, not 3] and you don't have to chase beyond the first list element) you still have that indirection somewhere into memory.
This all comes from the specification requiring even load factors arbitrarily close to 1 and to still have amortized constant time insertion, lookup and deletion.
In this case, there is an alternative view to 'architecture astronaut': Conservative programming. If you don't have experience with the spec, it might not be clear that n<=4 is the reasonable maximum if the spec permits n=255. Who knows what horrible JPGs exist in the wild? Even if you only find examples to test n<=4, are you willing to take the risk ignoring what's left? Your code needs to be out in the real world for a while before you can truly say. Better a slow but working implementation.
GP explicitly covered this: Just expand your fixed-size array to the required size, it's still not a big array. It solves the exact problem, the map is overkill and overly complicated. The problem is, the map is only more complicated in the resulting binary, not in the library users' source code.
Allocating 255 elements is not the end of the world. Unless the decoder is supposed to run on a memory-constrained platform, this is barely noticeable.
The code switches out a static array of size 3 for a hash table, in order to handle that incoming IDs (used as indexes in the array, then keys in the hash) can go up to 255 instead of just 2.
I would think that upping the array to size 256 would still be cheaper (or at least not a lot more expensive) in terms of memory used than adding a full hash table, but I have not inspected Serenity's hash implementation to validate that claim. It would almost certainly be cheaper in terms of performance, especially from less memory allocations.
Or even use an array of size 3, but include the IDs in each element and linearly scan for the right ID. As the whole array will likely fit on a cache line and you want to access all three elements in a short time anyway, this is likely no slower than a hash table.
That's what I was thinking after reading the rationale, there are plenty of options that don't involve a hash map.
That said, linearly scanning is not much different from a hash map with linear probing that has no empty slots left, so it's not necessarily obvious that the hash map actually takes that much more space.
I wouldn't expect a hash map that uses linear probing to wait until it's full (or even nearly full) before increasing the size of the table. Doing a linear scan on a hash table which is both nearly full and very large would be quite bad.
Agree. Even if the hash map somehow wouldn't use more than an array[256], it would still be faster as the compiler can optimize more due to simpler memory layout, and even the cpu can cache more efficiently and better speculate since less/simpler code is involved.
Always start with an array if the problem allows for it, even if you need to reallocate sometimes, or have to scan linearly. Measure performance before using a map or linked list.
Skimming the JPEG spec I think the commit author was waaaay off base. The actual spec is very clear that 4 components is the max and are expected to be sequential in nature (4.11 Summary of coding processes) unless I'm reading it wrong. Ultimately the original implementation was wrong fro not including a fourth channel per spec... but not wrong in assuming sequential.
Even if there are more than 4 components, they are still required to be sequentially ordered. Their decoder also errors out if there are more than 3 components (https://news.ycombinator.com/item?id=27375596), so all you need to do is keep the IDs in an array of 3 entries and check that they match later. That's what the fragment of code I posted from my decoder does.
That's what I'd assumed when I started skimming the spec, but the section I referenced seems to indicate a max of four regardless. I've never heard of a JPEG that's not RGB(A) or CMYK. So it would be highly unusual, not saying it couldn't happen but given how unusual it would probably be replaced with a custom format for that specific application.
Might be higher performance, wouldn't necessarily be less complex or less likely to result in a bug, as the original point in this thread suggested. It turns out it's not exactly an "overcomplexity" problem. And looks less like it was caused by “the mere availability and ease of use of abstract data structures naturally leads to their overuse.”
Even if not optimal, it definitely looks less unreasonable as a choice now that the original motivation found.
And if it had been “fixed” in the simple way without finding the original motivation, it very well may have introduced a different edge-case bug.
So one important lesson here, is always try to do some “due dilligence” to track down why code was written how it was before changing it, even if (or especially if) you assume it was just shoddy coding. (What’s the metaphor/aphorism used about this? I remember there is one, but can't recall/find it now).
(And also reminds us of the importance of good granular commits with good commit messages, to leave that history! Much easier to find original motivation than in a pre-version-controlled world).
> Using a hash table for at most 3 entries is astronomical overkill! The field for the number of components in a JPEG file is only a byte - so even if you were to support more than 3 components, it would be 255 at most, and sane JPEG files are unlikely to have more than 4 in practice. So fixed-size arrays are the perfect solution. I have a very good idea why it was written like that, however: the mere availability and ease of use of abstract data structures naturally leads to their overuse by what seems to be an overwhelming majority of developers, and the eagerness to "futureproof" and add unnecessary complexity is also very common.
> Overcomplexity creates bugs, and (leaky) abstractions hide them. Even if the complexity (and bug) isn't visible in your code, it's still there. If the initial code had been reviewed by me, it most certainly would not pass.
I suspect in this case the problem is trying to use this style of coding in a language like C++.
If you implement this in eg Python, using the hash table would have been perfectly natural and the simplest way to do it. (And a modern enough Python would guarantee your ordering.)
In Java, Haskell or Rust etc something else might have been simplest.
> Using a hash table for at most 3 entries is astronomical overkill!
I disagree. Simplicity of the code is most vital to avoid bugs, and if a hash table is what the language typically uses to represent key value mappings, that is what should be used, for 3 or 3 million items.
Keep the code idiomatic, and only optimize where necessary.
A good implementation of a hash table has a special fast path for tiny runtime sizes.
A good compiler could bound the size of the hash table, and switch out the implementation if its size can be bounded at compile time.
Sure, in this case, it didn't work so well. But the principle of idiomatic code still stands.
From experience, I can say that the amount of times I've heard those arguments also corresponds closely with the sort of code this article is about.
If you're using a language where all you have are hash tables (PHP comes to mind), then there's no way around it; but it's insanity to overlook simplicity.
> I have a very good idea why it was written like that, however: the mere availability and ease of use of abstract data structures naturally leads to their overuse by what seems to be an overwhelming majority of developers, and the eagerness to "futureproof" and add unnecessary complexity is also very common.
Or perhaps the code is shared with some other project which does require the added complexity.
Yes, hash tables need to iterate in either a guaranteed order (usually insertion order) or a guaranteed to be random order. Either choice make sense, but one of them needs to be picked or people will create bugs like this.
FWIW most modern hash tables are randomised on a per-run basis as side-effect if the hash being seeded with a random value for attacker-controlled collision resistance.
One year ago I hardened LuaJIT's VM against these kind of attacks. Since then, there has been a constant influx of complaints and issues filed. All bitterly complaining their code, which mistakenly assumed a fixed hash table iteration order, is now broken.
Even when told that the Lua manual clearly states the undefined order since 20 years, they do not cease to complain. They do not realize this change helped them to discover a serious bug in their code (the order could differ even before that change). Sigh.
You can now have a guess, what one of the lesser enlightened forks of LuaJIT did ...
I’m not surprised. The same issue occurred in python.
And to be fair it’s a pain in the ass to debug and find out why something happens to implicitly depend on iteration order (float stability is common but not alone). And their code did work beforehand, for most values of work.
The biggest pain in the ass is that — at least in python - while you can set the hash seed explicitely if you don’t the langage doesn’t tell you. This makes reproducing the issue very annoying when only some seeds trigger it.
> the order could differ even before that change
While the order could differ I assume it was deterministic and nothing influencing those bits had changed in a while.
And now, as predicted by core developer Raymond Hettinger in his Modern Dictionaries talk, Python's dicts are now guaranteed to be ordered by default (as of 3.7).
To the extent this creates a performance penalty, it's a little annoying that a few systems create a behavior dependency that is not truly needed in a core type when there is a separate type that implements the desired semantics. But then again if it doesn't make it slower it should be fine.
Fwiw, the way Raymond re-implemented dictionaries made them more efficient (the algorithms are the same, but it's now much more cache friendly), and had the side effect of making them ordered. He and many others advocated for having the ordering guaranteed going forward.
And now they're locked into an implementation that is order-preserving, even if an optimization tomorrow could eke out another 10%, if ordering was not guaranteed.
That hash values get distributed differently inside the hash to avoid collision attacks is totally unrelated to what iteration order you get when you iterate elements in the hash. You can easily get predictable iteration order and have hash collisions that is not predictable.
Look at Python, Ruby, PHP - they have defined iteration order to be insertion order. Javascript also have predictable order though the order is a bit more complex. Go is the outlier in that it defines it as guaranteed to be random. But both choices are just fine.
> totally unrelated to what iteration order you get when you iterate elements in the hash
That is a bit too strong am assertion: you can decorrelate the two when you build the collection, but the naive set up is very much that hash randomisation will randomise iteration order.
With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody.
Mark my words: Hyrum's Law will strike in full force and introduce a whole new world of pain once more and more software development will be done (or helped-) by machine learning.
What happened here is a perfect example of what an optimiser might come up with and resulting systems will be so brittle and delicate that a whole new form of cargo cult might develop around them :D
I wonder if there actually is a possibility of something like that in human genome. Are there any known bugs that can be introduced to human genome and instantly kill a human after, e.g. 18 years of normal operation?
“Known” probably not because nobody cares but if you can edit the generic code of all cells (not just one) you just need to do a function-affecting substitution of a protein-coding segment for a vital protein.
Doesn’t matter which one as long as it’s not solely developmental.
It would not quite be instant though, the body would first have to run out of the existing supply then break down as operations stop. If you pick the right protein (e.g. something affecting the muscles or brain) I’d think the timescale would be minutes if not seconds.
If you can only edit one cell or a small number if them then probably not, you can definitely cause long-term issues (e.g. cancers) but I would not think instant death an option.
I think the GP was talking more about existing genetic defects leading to sudden issues down the line though.
Surely they’d already the case? There’s tons of latent bugs which are only revealed because the conditions to trigger them have not been encountered yet. Every 4 years code breaks all over in leap days. A program crashing after not crashing is essentially “the system self destroying”.
There's even a CPPCon talk about Google's "Swiss" tables with interruptions by Hyrum each time the new code's observable behaviour changes because, at Google, they internally have enough users for Hyrum's law to kick in all the time:
The talk was presumably better live because Hyrum doesn't have a microphone, so his interjections are hard to follow, but it's a good way to illustrate.
Some of the examples are much like this story, you design this sophisticated, heavily optimised unordered container for millions of items, and then your users put three items into it, and they depend on the order of those items and you "broke" their code by not having this behaviour you never promised and it made no sense to depend on.
And the defence mentioned in this thread (randomise things which aren't guaranteed to have stability) is introduced because these are Google employees, they know about Hyrum, it just isn't enough. Sure the code failed when I tried to store 40 items due to your randomisation, but I found it worked OK with ten items as then the randomisation was defeated, so now I store ten items, and then your latest change broke that...
The pathological use of data structure is why stuff like making the constructor for empty strings very cheap has a big effect on some codebases. Rust const-ified String::new() because an empty string doesn't need any heap store. So this means code which loops over a large array setting all the members to the empty string is only about as expensive as looping over an integer array setting all the integers to zero, it doesn't incur a heap allocation, or a function call. Of course modifying these strings now takes on the expense of allocating, but guess how often people who do stuff like this never actually touch the string...
>they depend on the order of those items and you "broke" their code by not having this behaviour you never promised and it made no sense to depend on
reminds me of a time my extension started glitching because localStorage stopped returning results in same order they were put in :o) Now that I think about it it happened somewhere around 2017-2018, did google drop it into Chrome at this point?
Inserting randomness can help prevent denial of service attacks as well, by making it harder to manufacture pathological hash collisions (which can turn ~O(1) operations into O(N) operations).
Another place to insert randomness that could have uncovered this would be in the allocator size classes. In non-optimized builds, if your allocator has size classes, just pick them randomly. In my experience this uncovered a lot of bugs in programs linked against tcmalloc.
Right, like every release of any component should vary some detail that the spec doesn't guarantee. Especially, all hash functions should change on every release -- at least when building in debug mode.
Hash functions should use a random seed on every new container instance (or at least once per launch). Those hash tables that need specific properties of the hash should inject it in that way (cryptographic, deterministic, DOS-resistant, etc).
Interestingly, Python deliberately changed it in the other direction. Since Python 3.7, the iteration order is guaranteed to be the same as the insertion order.
This was previously a special object (collections.OrderedDict) but is now the default behavior of every dict:
Python probably still does the same under the hood as the randomized hash was required to harden web frameworks against a denial of service attack that used long lists of keys with colliding hashes to force a worst case runtime.
Yes that is because sets have their own hashmap implementation and the dev team remains so far unconvinced of the value of “unique lists”, and the primary gains of the new scheme might be nonexistent or of low impact for the set situation (for python the ordering properties were an ancillary change, not the reason for them))
I was just bitten by this the other day. In my case, it's not that I cared about any specific order ("unique lists"), but I did need the output to be deterministic for a caching mechanism to work properly. I knew dicts would 'just work' because of the ordering guarantee, but I was surprised to learn sets didn't. In the end the easiest fix was to just reimplement a set type on top of dict.
A problem with this is that Go code that uses native maps is non-deterministic, and is very annoying to make deterministic-ish when reproducing results is important.
"is non-deterministic, when you use a plain range to iterate over such a map".
If you need ordered iteration to stay deterministic, extract the keys, sort them, then iterate over the (now-sorted) slice of keys?
If the order matters, you should defensively make SURE you are iterating in that order. If the order doesn't matter, the fact that each iteration over the map is different helps you not fall into a "I know this" trap.
> If that is required there are other data structures to pick from.
Not really since it’s Go (“lol no generics”): there’s no easy way to swap the builtin hashmap for an other associative array with deterministic behaviour and you will lose something (definitely performances, likely either convenience or type-safety, possibly both).
No different than using C, pre-C++98, Pascal, BASIC, Java pre-1.5.
Yes it sucks not having generics, why should the built-in contribute to bad practices of relying into interaction behaviors, thus cripling any performance improvements to the data structure?
I think the general idea is "shift left" which you can say exposes bugs earlier, or prevents bugs from reaching production, which is kind of a "six dozen of one, half dozen of the other". Bugs which are exposed early enough do not get committed.
"Shift left" just means that you shift the feedback to the developer farther left in the build process, closer (in time and space) to when the mistake was made.
I don't know about go specifically, but many languages randomize their hash tables to avoid Hash Collision DoS attacks. This is a problem for any piece of software putting attacker-controlled data into a hash-table that does not randomize the hashing and where they assume that it will be amortized constant time.
No, it’s the builtin hashmap. It had a random seed for the hash to avoid collision predictability but it also, separately has a shuffled iteration order for maps.
> it's only as hard as using a (different) data structure that preserves order.
Which turns out to be pretty inconvenient in Go. So more likely you’d do something like copy the keys to a slice, sort the slice, then iterate that to get the map values.
Incidentally, that’s exactly what the json package does when encoding a map.
I doubt that’s why. Modern hash tables typically include a random factor in the hashing to prevent bucketing from being predictable, so attackers can’t craft malicious input that causes every value to be put in the same bucket as a form of DOS attack. The fact that this means iteration order is randomized is a side effect.
The insertion of a random factor in hashing/bucketing that you're describing would change the iteration order per _instance_ of a hash table, not per _iteration_ as the parent comment notes is the case for go maps.
Great read, and fantastic debugging! Correct me if I'm wrong, but I think this bug would have been caught if a CMYK JPEG would've been tested, assuming that an extra component would break the accidental well-orderedness of the stream.
I can absolutely confirm that this is what it's like to diagnose weird bugs on nascent operating systems, particularly on or nearby "portability boundaries", which are not always what you think they are. Pull that thread, but be sure you're ready for jenga.
The code was wrong, but the produced binaries were not.
From the user perspective, a change to malloc really did break JPGLoader. That it broke this by revealing a pre-existing fault in the code is interesting to note. Especially for the developers. But it does not change the user experience.
Fuck, Lenna [1]? Could we stop alienating women in tech [2]? Don't take my word for it, the AV1 folks are pleading with the field to stop using Lenna too [3].
Huh, I had no idea that image was controversial. I think it's an easy mistake to make. Didn't know it was from a nude magazine originally, I just thought it was a traditionally used test image, like the Utah teapot in 3D rendering.
I wouldn't jump to the conclusion that the image was intentionally chosen to alienate women in tech, though.
Thanks for the TIL!
Nobody would know or ever care to know if these so called justice warrior weren't always mentioning it and linking the whole naked picture every time there is a chance to do so. "Look at it! How _disgustingly_ immoral!"
It's a beautiful portrait, and a standard in image compression comparison. Deal with it.
I strongly dislike that prudish and puritan zealots now exploit the ideals of feminism and social justice to push forward the idea that we should be afraid of sex and our bodies. They're not "doing it for the women in tech", but because of their own issues with sexuality.
"If a professor makes a sexist joke, a female student might well find it so disturbing that she is unable to listen to the rest of the lecture. Suggestive pictures used in lectures on image processing are similarly distracting to the women listeners and convey the message that the lecturer caters to the males only. For example, it is amazing that the Lena pin-up image is still used as an example in courses and published as a test image in journals today." ~D. P. O'Leary, 1999
If you appease the complainers they will only find something new to complain about. They do not seek justice or a better world but only attention for themselves and complain simply for the joy of complaining.
I take offense in the proposed alternative pictures. There are lots of war related gear shown - military propaganda - potential disturbs pacifists. Also there is a pretty lady shown, too. (I actually did not get, why this is supposed to be sexist?)
Seriously, I doubt you can find any picture at all, where no person find reasons to be offended.
I would not be so sure about it. I am pretty sure, there are parents whose children have a (imagined or real) fructose malabsorption and therefore have to avoid fruits as unhealthy, or will investigate, that the fruits in the picture are from too far away, propagading a climate-harmful livestyle etc. etc.
And I bet most people did not know, that the picture in question is from Playboy, I did not. But still, so what?
It shows a attractive women in a slightly seductive pose. Is that really sexist? Why?
I find it disheartening that so many people object to changing a single picture.
I think this account by Maddie Zug [0] summaries most peoples objections to the image. Women just want to participate in tech, but the male dominated field already makes it hard enough, especially when images with a very sexual origin (again, it isn't the pretty woman, but the origin of the pretty woman) are used. It is that difficult to use another picture from the above database?
"I retired from modeling a long time ago. It’s time I retired from tech, too.” - Lena Söderberg
I personally think the Lenna image has transcended it's origin and is to me not more sexist than, say, an antique nude statue. But if people are bothered by this - if they are weirded out or feel even a bit uncomfortable - then it is a no-brainer to just take a different picture!
Might I suggest using an image of used tampons and spatters of menstrual blood as a standard. Surely no one is disturbed by something so natural and none of the boy who like to play with their woodys under the desk while giggling over centrefold pictures will be discouraged from continuing with the class when forced to look at it.
Nobody's saying “don't use pictures of women” – or even “don't use pictures of models”. They're saying “don't use a cropped Playboy pin-up”. That's not “like Saudi Arabia”.
Answer me this. Why not? Has the model herself lamented and regretted the fact she posed for that picture? No. *
Quit being prudish and embrace the human body. There is no sin in appreciating beauty. And it's not like it's forced nudity, it's always the offended ones resurfacing the Playboy original. This article doesn't deserve any hate for using an innocent and historically-relevant picture.
* 'In January 2019 [Lena Forsén] said that while she wished she had been better compensated, she was "really proud of that picture"'.
While the test image itself doesn't show much, the fact that it originated from Playboy magazine is widely known – and presumably has contributed to the image's popularity. It's a sort of in-joke: you can't understand it just by seeing it, you need the context. But it's an in-joke that recalls sexist tropes.
To elaborate, the wikipedia article cites some organizations moving away from Lena, but there's only one source calling things disturbing and suggestive, and that source is the exact same quote that's being questioned. Linking to that is not a useful answer. It does nothing to explain why that particular quote is correct.
Sorry I'll reword. The question asked was "How is Lenna's picture disturbing or suggestive?"
So don't focus on the word "correct" as much as the "explain why". They asked "how" and they did not get an answer.
You linked to a page that doesn't help answer that question, because it only has the same quote.
Lots of things are opinions. But you can justify an opinion, explain an opinion.
If I was going to argue against using Lena, I'd say things about the context of the image and why that's a problem. Any problems with the 512x512 image itself are much less obvious, and I think it's fair to want a more detailed argument on that front.
I dismissed that quote because we're not arguing about whether Lena should be used or not. That was never the question in this entire chain of comments.
> The question is ""How is Lenna's picture disturbing or suggestive?"
Right. Which is a subset of the possible reasons to not use it. Everything about playboy is outside that subset, and does not answer the question, despite being a reason not to use it.
Much like "It's a violation of copyright" would also be a reason not to use the image that doesn't answer the question.
We could use David Michelangelo (in it's full glory) and i bet not a single Man would be disturbed (maybe quite the opposite)...problem he's mostly white.
It's a shame that a ~naked body is "disturbing" for US-Females.
BTW: Why so many human killing machines as alternatives? I feel offended now.
The links are dead, but the citations are valid. Here are working links:
J. Gutbezahl, How Negative Expectancies and Attitudes Undermine Females' Math Confidence and Performance: A Review of the Literature.https://eric.ed.gov/?id=ED380279
So I looked into these links. On [3] there is no data that backs this quote of hers:
"Suggestive pictures used in lectures on image processing are similarly distracting to the women listeners and convey the message that the lecturer caters to the males only."
The 30 (!) year old Spertus paper even states that most of the data collected was anecdotal.
Ignoring the tasteless act of trying to use the slack at the end of an allocation for your side projects, the bug really highlights the dangers of illiterate interfaces, to wit Color::Color(u8 r, u8 g, u8 b) from Serenity LibGfx. Calling this incorrectly is incredibly easy to do and invisible to the reader at the call site. Authors need to work with the type system to avoid these bugs, in this case having distinct types for red, green, and blue so the compiler will stop you from assigning red to blue without an explicit conversion. "Simple confusion of RGB vs. BGR" is not something that should seem natural to programmers. It's an error caused by underuse of the type system, just treating everything as a bag of numbers.
> "Simple confusion of RGB vs. BGR" is not something that should seem natural to programmers.
These kinds of errors are extremely common in much more mature libraries, including OpenCV, as well as HLSL or GLSL. No disrespect, but it sounds like you've never really touched graphics programming before. Color spaces are literally just a bag of numbers.
A good analogy is coordinate systems. There's no type difference between an x-coordinate or a y-coordinate; they're just numbers and are defined by their ordinal position in an (x, y) tuple.
To add to this, libraries aren't consistent about convention either. Take the common "RGBA". Is this describing the byte order? Or describing most-significant to least-significant 0xRRGGBBAA, which might be little-endian encoded into a "ABGR" byte order? If you think that's stupid idea and that I'm an idiot for suggesting it, what about 16-bit RGB formats where the G component straddles both bytes? Okay, so those aren't so common anymore - but what about DXT/BC compressed formats which encode 4x4 pixel chunks at a time, and don't cleanly decompose into "channels" in the first place?
D3DFMT_A8R8G8B8 (Direct3D 9) is equivalent to DXGI_FORMAT_B8G8R8A8_UNORM (DXGI / Direct3D 10+). Note that one enumeration lists the components in reverse order of the other, and that this is correct! https://docs.microsoft.com/en-us/windows/win32/direct3d10/d3... .
Documentation often completely omits information endian or encoding - and when it doesn't, it's often hidden away where you'll never find it, and usually assumes x86/x64/little-endian processors. The behavior on big-endian machines is best found out through testing - even if the documentation is clear, the documentation stands a good chances of lying, and CI probably doesn't test on big-endian meaning bugs have likely arisen, and there's a good chance your copy of the library is old and doesn't contain any bugfixes.
In light of all of the above, RGB vs BGR confusion is one of the most natural points of confusion to run across when dealing with image formats. "Just use the type system!" ignores where these bugs crop up - (de)serialization, converting streams of bytes to types or vicea versa. Someone must write the code, declaration, whatever - and the type system has no means of ensuring that correctly matches whatever not-computer-readable spec that the (de)serialization is supposed to match - and so it will, frequently, be understandably incorrect.
Ok, but bring it back to this article. It is marshaling typed struct fields to typed parameters, but because the types are all primitive, the author entered them backwards and the language platform didn’t notice. The type system can prevent this flaw.
The author originally suspected - or hypothesized - that was possibly the bug, but the actual cause detailed in the post ended up being different (nondeterministic hashmap iteration over components) and the final patch does not include the initially proposed "fix" of swapping the red and blue components as was shown towards the very top of the article. The original order was correct. Swapping red/blue would've just been canceling out one bug by adding another.
The type system can add some roadbumps to encourage centralizing the conversion of "untyped" disk bytes to typed r/g/b components or a typed color component, but that conversion must still occur somewhere, and it can be written backwards at that somewhere, and one of those somewheres will be inside JPGLoader.cpp. That's the fundamental job of a JPEG loader - conversion of an untyped byte stream into useful types for the rest of the program to use. Additional use of the type system might limit where within JPGLoader.cpp such a component swapping flaw might be likely, but it's never going to prevent it for all of JPGLoader.cpp.
The existence of the Color type already makes such a flaw fairly unlikely outside of JPGLoader.cpp and similar conversion points. That's sufficient use of the type system. More within JPGLoader.cpp, while possible, would be overkill based on my own experience with color conversion and swapped component bugs. Hell, to play devil's adovcate, further use of the type system could cause such flaws! The extra code is likely to cause additional reviewer fatigue in code reviews. Reviewer fatigue causes inattention. Inattention causes the reviewer to miss swapped or nondeterministics components. Perhaps such inattention caused this bug!
I would be in favor of replacing:
const Color color { (u8)block.y[pixel_index], (u8)block.cb[pixel_index], (u8)block.cr[pixel_index] };
with:
const auto color = Color::from_rgb((u8)block.r[pixel_index], (u8)block.g[pixel_index], (u8)block.b[pixel_index]);
To make it clearer that this code is correct as is. Doesn't add any new types though, just names a constructor in such a way as to imply an argument order, and uses the existing r/g/b union aliases of y/cb/cr in macroblocks (r/g/b is more correct at that point in the code, since ycbcr_to_rgb should've been called already - it's possible this code was written before those union aliases were added?)
It's still possible to have a copy+paste bug with my replacement by using r/g/g as input instead of r/g/b or similar, but I've seen such bugs even with per-component types - just need the first copy to be mistaken and copy+pasted all over.
I'll also note: one of the recommended ways to avoid bugs where you type out rgg instead of rgb is to use loops and index by component instead of manually naming the components. Perhaps:
Color color;
for (size_t c=0; c<3; ++c) color[c] = (u8)block[c][pixel_index];
However, this becomes impossible in many programming languages if you give every individual component a unique type!
Also, even if we model the color space in some overly clever way, we are talking about a JPEG loader here. Its input is inevitably a bag of numbers, and a horribly underspecified one.
I felt like I forgot to add a link. Why don't we see images in wrong color space nearly as often as text with wrong character encoding, remains a mystery to me since the moment I was writing a toy JPEG decoder and found this article:
Probably, a large part of the answer is that you can walk away from these tables in horror and simply assume YCbCr or grayscale, and still show most files in the wild correctly. Serenity seems to do just that. For text, UTF-8 is still not quite as dominant.
See, you've confused age with maturity. A library that is old and still broken and hard to use isn't mature, it's just old. A mature library would be one that has learned something from the history of software development.
foo(5, 42, 99) is just a bad interface. foo(red(5), green(42), blue(99)) gives the compiler enough to whine about.
Parsers, decoders and decompressors consume streams of bytes. The bytes from these streams come with no type info. Type safety happens on higher levels, once these byte streams are already parsed into something else, at least partially so.
In Rust this is a common pattern because single-element structs are unwrapped by the compiler, but only after typechecking. So they can be easily leveraged to prevent type confusion between differently typed values with compatible underlying byte representation.
// single field tuple structs
struct R(u8);
struct G(u8);
struct B(u8);
// using a tuple struct
struct Color(R, G, B);
// or using a struct with named fields
struct Color {
r: R,
g: G,
b: B,
}
I assume something similar can be done in C with no loss of performance under most compilers, since unwrapping a single-field struct is an optimization that a compiler should pick up on every time.
Yes, that will prevent many errors, but code like this converts bytes to colors, and there, you might accidentally write (example may or may not be valid rust)
wouldn't be valid in Rust. A typedef in C or C++ (or in Rust, which also has them) defines a new name for a type, but it's still type compatible, so you could do something like this. But an 'R' above is not a typedef of u8, it's a separate type, and it is not type compatible. You could not construct a Color by passing a[0] as an argument. You would first have to wrap a[2] as an R, a[1] as a G, a[0] as a B, at which point you would have another opportunity to notice an issue, and you would receive a type error if you tried to call
let r: R = R(a[2]);
let g: G = G(a[1]);
let b: B = B(a[0]);
Color(b, g, r);
> Also, you’ll need types for every single color space. Do you mean sRGB, Adobe RGB, scRGB, etc. Since that set isn’t fixed (every device potentially had its own RGB gamut) you’ll need a programming language that can create types at run time if you want to write a program that dynamically handles all kinds of devices.
Types in Rust don't exist at runtime, they're a compile-time concept, so creating types at runtime doesn't make sense. This sort of thing is generally approached by having the compiler monomorphize generics in places where various representations overlap, and otherwise just having a separate set of types for each non-compatible representation. The correct library would then be picked dynamically from the set of available options compiled into the binary.
> I assume something similar can be done in C with no loss of performance under most compilers, since unwrapping a single-field struct is an optimization that a compiler should pick up on every time.
For external functions, it's not an optimization but an ABI requirement (for at least some ABIs) and even for TU-internal functions compilers tend to stick closely to the ABI.
Not the GP, but I'm guessing it has to do with creating distinct R, G and B types and then build Color out of them. This will still have human failure points when instances of these types will need to be initialized from the raw data. It also won't be succinct, that's for sure.
Last time I looked a few years ago there was some issue with class instantiation and the order of initialization that was holding them back from implementing designated initializers.
(If you still manage to get the ordering wrong even when you're forced to write the name of this macro... then programming is probably not for you.)
Edit: I just noticed the return type is "void". This error is entirely due to the destruction of MSDN; my copy of WIN32.HLP has the correct definition.
> If you still manage to get the ordering wrong even when you're forced to write the name of this macro... then programming is probably not for you
Humility is key to prevent disasters, and ordering will be gotten wrong as long as there is no type checking and we are relying on people to double check the ordering. A mistake can happen because of a temporary lack of attention, tiredness, confusion, distraction, interruption… I hope any experienced programmer is aware of this, or maybe they should avoid programming?
I don't think it solves the problem, because you can still swap 1 and 2.
Ideally, 1 and 2 are of different types (which may not always be practical).
It could make it easier to spot an error if 1 and 2 are in fact variables that have a name (RGB(R(green), G(red), B(blue))) screems HEY SOMEHTING IS WRONG THERE).
Why this code was written this way instead of just checking against the ID by linearly iterating over the Components, I have no idea.
There's another comment here that shows their JPEG decoder only supports 1 or 3 components. Using a hash table for at most 3 entries is astronomical overkill! The field for the number of components in a JPEG file is only a byte - so even if you were to support more than 3 components, it would be 255 at most, and sane JPEG files are unlikely to have more than 4 in practice. So fixed-size arrays are the perfect solution. I have a very good idea why it was written like that, however: the mere availability and ease of use of abstract data structures naturally leads to their overuse by what seems to be an overwhelming majority of developers, and the eagerness to "futureproof" and add unnecessary complexity is also very common.
Overcomplexity creates bugs, and (leaky) abstractions hide them. Even if the complexity (and bug) isn't visible in your code, it's still there. If the initial code had been reviewed by me, it most certainly would not pass.
I am saying this as someone who has actually written a JPEG decoder --- it is not as difficult as it may sound, and I highly recommend giving it a try --- and also witnessed (and fixed) the messes that people get in with overcomplicating things. Perhaps that slight "architecture astronaut" mentality comes from starting with a higher-level language; I started with Asm and my preferred language is usually C, and I haven't seen the tendency to overcomplicate much in others who did the same.
The reason the Components were in a hash table was that they would then be checked against the component ordering in a “Start of Scan” section to make sure all the components in the SOS section are in the expected order.
For comparison, this is the corresponding code in my JPEG decoder. I didn't even bother using a loop: