Hacker News new | past | comments | ask | show | jobs | submit login
JVM Anatomy Quark #26: Identity Hash Code (shipilev.net)
101 points by fniephaus on July 20, 2021 | hide | past | favorite | 65 comments



I don't see how

> "Not suprisingly, [...] global counters [...] are poorly distributed as well."

is true.

I mean, yes, they are not uniformly distributed, but that was never the requirement. As the article itself states, the desired property is that "the values for distinct objects are more or less distinct". With a global counter, you get maximally distinct hash codes. More distinct than any of the other approaches (and not less than any user-implemented function), at least until 2^31 object allocations.

Yes, after 2^31 objects you will get repeated values, but that is trivially the case for any pattern of assigning 31 bit hash codes to distinct objects (and any of the pseudo-random approaches will get birthday-paradoxed much sooner, and much harder). The only case where this could matter is in an adversarial scenario where someone is trying to DoS your hash map with crafted inputs. But according to the article itself, it would take 4+ minutes (120 ns * 2^31) of only allocating objects for each global counter wrapping. If an adversary can reliably achieve that already, what's the point in slowing down a hash map by an epsilon every four minutes?


> the values for distinct objects are more or less distinct

I think these words of author understate the requirement of good distribution of hash codes. As far as I understand, ideally the hash codes for different objects should be as distinct as practically possible, so that they are often put into separate buckets of a hash table.

Consecutively allocated objects will have almost all bits of their addresses equal.


One neat trick could be to use a linear congruential generator to iterate through the 32-bit integers without repetition, as long as you choose the right parameters. Fortunately, these are relatively simple to pick [1] when we're dealing with powers of two.

This would give you fairly easy-to-generate IDs which still have a period of 2^32 but where subsequent allocations have an ID that shares fewer bits on average with its predecessor.

[1] https://en.wikipedia.org/wiki/Linear_congruential_generator#...


I think that’s similar to what the article describes.


The article hadn't specifically discussed rng period but I see the linked code [1] uses an implementation [2] with a period of 2^128-1

[1] https://github.com/openjdk/jdk/blob/4927ee426aedbeea0f4119ba...

[2] https://en.wikipedia.org/wiki/Xorshift


I wouldn't trust any hash table where a hash function yielding consecutive integers would inherently lead to bad behavior. Can you tell me a popular hash-to-bucket reduction that performs badly for this case?

I will concede that there are plenty of schemes where insufficient entropy in lower bits causes problems. Combining those with the global counter hash and e.g. only inserting every 64th allocated object could be a failure case indeed. But this is still simple to defend against in the reduction scheme.


I think its about the naive "rep_array[hashcode(obj) % bucket_size] = object" which is all too common.

If your rep_array is doing linear probing and/or robin-hood hasing, then incremental hashcodes (such as 1, 2, 3, 4, 5...) is a bad thing. Especially if you're doing both inserts and removals: this sort of incremental pattern would lead to many "runs" where linear probing would perform poorly.

Of course, it isn't very hard to do rep_array[(hashcode(obj) * large_constant_odd_number) % bucket_size] instead and get a good distribution. But the question is whether or not people know about those kinds of steps.


> The only case where this could matter is in an adversarial scenario where someone is trying to DoS your hash map with crafted inputs.

That happens in practice.


Does it also happen in practice where the crafted inputs are given at minimum 4 minutes apart, on the precondition that those minutes are filled with finely-controlled-by-the-attacker spinning?


People are obsessed with that siphash paper, and they're obsessed that Java hashcodes do not do the siphash mitigation, but I don't know if any of it matters. People are obsessed with thinking it matters though.


I think it was a mistake to put hashCode (and to an extent, equals) to all objects by default. It makes Objects bloated, and because of that, those who will never be put to a hash table also need to provide their own hashCode implementations, causing problems like the one mentioned in the article. There should have been a separate, dedicated "Hashable" interface. Strangely, C# shares the same caveat. Is there any benefit to making objects hashable by default?


I don't know about bloat per se, but I think putting `equals` on all objects is a design flaw. This implicitly assumes that every object has a single implementation that works for every case, which is frequently false. Some objects have no semantically meaningful concept of "equals". Others have situation-specific definitions.

A better design (which C# has made some steps towards) is defining interfaces for Equatable and Hashable, and requiring the object to implement methods that return them. Then they can return them or not, or have multiple implementations of each if needed. Or users of the objects can define their own, custom implementations easily.


The vast majority do though. This was a design decision to be useful 85% of the time and allow for graceful degradation into edge cases (you can always just return false from equals() always)


> Equatable and Hashable ... have multiple implementations of each if needed

i dont believe this would fix the problem, because the reason you'd want multiple implementations is because you'd want to use different ones under different circumstances that may only be determined at runtime anyway.

I much prefer the haskell way of thinking about typeclasses, but this mechanism is fairly difficult to implement in java...may be object algebra can potentially be used? see this paper https://docs.google.com/viewerng/viewer?url=www.cs.utexas.ed...


It's definitely convenient and served Java well in the sense that it always was a fairly easy language to get started with. Other languages make similar choices for the same reason; it trades off good enough against perfect. It wasn't a mistake but a deliberate choice. Also important to realize that this choice was made in the early nineties, 30 years ago.

But you are right that it's not free. Recent and upcoming changes to Java and the JVM are providing solutions in the form of e.g. value classes, records, etc.

Additionally, hotspot is doing lots of clever stuff under the hood where it makes sense. Finally, if you know what you are doing, Java provides plenty of ways to optimize things.


Moreover the Hashable interface really needs to take a `HashState` object that can be given blocks of memory regions. Then you could plug in arbitrary algorithms rather than having each object hard-coded to use some half-baked implementation that's usually wrong & definitely wrong for anything that needs cryptography/DOS protection in the hash table. The reason is that Hash(m1 + m2 + m3) is what you want as it's computationally optimal & lets you plug in crypto hashing algorithms, but the old technique that Java bakes into the language forces you to write Hash(Hash(m1) + Hash(m2) + Hash(m3)).

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n398... which I'm sad hasn't been adopted by C++ yet. It looks like Rust has taken note of this paper.


Abseil uses this form of hashing: https://abseil.io/docs/cpp/guides/hash


Given the algorithm used, I don't really understand what bloat you are referring to. Compared to assembly? I guess maybe. But this is a managed language with near-cpp performance. Small price to pay.

Also, there's utility for hashcode/equals far beyond the collections framework... having an identity in a toString() for instance is of great value.


Yes, seems it was just for hash tables. From the JDK 1.0 doc:

"""

Returns a hash code value for the object. This method is supported for the benefit of hashtables such as those provided by <code>java.util.Hashtable</code>.

The general contract of <code>hashCode</code> is:

Whenever it is invoked on the same object more than once during an execution of a Java application, the <code>hashCode</code> method must consistently return the same integer. This integer need not remain consistent from one execution of an application to another execution of the same application.

If two objects are equal according to the <code>equals</code> method, then calling the <code>hashCode</code> method on each of the two objects must produce the same integer result.

"""

hashCode is also related to object comparison (see the last line) and this was 2 Java releases before `Comparable` existed.


I'm not sure what you mean by bloat. Adding methods doesn't really affect individual object size, just class size. However, actual code weight (rarely) really impacts anything.

In the case of Java/C#, hashcode and equals are optional anyways. So if you don't override them, there's no extra space consumed.


The storage of the identity hash code takes ~4 bytes in the object header, and that is independent of whether the method is overridden or not. It's explained in the article.


But assuming that the other bit in those 4 bytes is required and there isn't some other place available where you could sneak it in, those ~4 bytes would still be allocated (for that one other bit) and only take less memory under some compression scheme.

And chances are that even if you did get by without that bit (or found another place for it) you'd in many cases spend far more memory on workarounds for stuff where the creator was too stingy for implementing "IHashable" but you actually need it


Well, in Virgil, a class without a superclass is a new root class, with no assumed methods. To use an object of that class as a key in a hashtable, one provides a hash function to the hashtable constructor. (The hash function can be a class method, or a top-level function; with generics, a hashtable key could as well be any type, including primitives). Of course it depends on how your write your code, but most of my objects do not end up as keys in a hashtable, so don't need a hash function nor storage for an identity hashcode.

So the smallest object header for a Virgil object is 4 bytes: it's just a type ID that is used for dynamic casts and as an index into a table for GC scanning. Arrays have two 4 byte header fields: the type ID and a 32-bit length. That's considerably more memory efficient than 2 or 3 64-bit header words for Java objects and arrays, respectively.


You generally need to reserve at least a pointer sized block of memory for a Java object because they can be used as locks with the sychronized keyword. Very few objects ever get used as a lock, so they make use of the bytes for other things like the hash.

Of course, that does imply the synchronized keyword has a fairly substantial memory cost.


Agree. Everything built-in to every object is a recipe for overhead.


They don't mean code weight - they mean field weight.


I feel that if your application is so sensitive to this level of "weight", you may be either better off using something lower level (like C?), or use direct byte array buffers, and encode your data that way rather than as objects.

Not very many applications would have this sort of requirement tbh. Real-time applications, and perhaps, huge scale data applications.


I agree. Built-in equals/hashCode is never what I want to use. Every time built-in version was used in my code was because of mistake and had to be fixed by providing proper implementation. This functionality is a source of subtle bugs.


I believe C# shares a few caveats with Java because they wanted to make it easy to port Java codebases to it. Making objects hashable by default and having arrays be covariant, despite early plans to support generics, both aid in that.


It doesn't cost as much as you might think. The word it's stored in is overloaded with the object lock and garbage collection bits.


Where is the bloat? There are default behaviours for equals and hashCode, so you never need to define them if they're not useful for your Object.


Did you read the article? The JVM reserves ~4 bytes in the object header of every object to store the identity hash code, regardless if you override .hashCode() or not.


Compared to all the other bloat in Java, this is pennies


If you have an object that has only one field, it still consumes at least 3 words of memory; 2 for the header, and 1 for the field. That's a 200% memory overhead. Of course that's extreme, but I've seen estimates as high as 40% of memory space the heap in large Java applications is just object headers. You probably need at least 1 header word in any scheme, but 2 is wasteful IMHO.


There's a way to save memory by sacrificing .hashCode() runtime performance (on the assumption that most objects never get hashed) by keeping a separate lookup table for hashcodes, at the cost of having to do the indirect lookup. I don't recall which VM used that technique. There's also fun tricks like rewriting/optimizing the object header during a copying/compacting garbage collection.


Yeah, I know that trick. I implemented an identity hashmap[1] inside of V8 that uses that, because JS objects don't have an identity hash code[2]. It uses the object address as the basis of the hash code, which means the table needs to be reorganized when the GC moves objects, which is done with a post-GC hook. The identity maps in V8 are generally short-lived.

I am not aware of production JVMs that use this, because in the worst case they can use a lot more memory if a lot of objects end up needing hashcodes.

[1] https://github.com/v8/v8/blob/dc712da548c7fb433caed56af9a021... [2] I guess they might have an invisible one now; I haven't looked at the implementation of WeakHashMap/Set.


What objects do you have lots of that you're not storing in a HashMap or similar?


They use int instead of long which means if you start to group objects together in any magnitude you will get collisions fairly quickly.

Any large HashTable in Java starts to yield the problem of duplicate keys, it's just a weird situation, like you can 99.999% trust something ... but can't ever fully trust it so that over time, you're guaranteed to have something wrong.

This problem exhibits itself again under the hood in the JNI API when you have to identify objects in another domain i.e. C++.

It's not a 'Quirk' it's basically a big mistake.

The ability to uniquely identify objects is so important in so many ways.

Sometimes I wish every few versions of Java they would skip the reverse compatibility and make some needed changes.


Hashcode is not used for identity and is explicitly not intended for this and never was. The whole point of this article is that the modern default implementation does not even rely on object identity anymore (it used to). Java has object identity of course. Most good equals implementations rely on that to figure out if the object is being compared to itself before running more expensive operations do field by field comparison.

Collisions are a good thing actually if you are implementing a hash table. Otherwise you end up with one bucket per object; which does not scale very well. The reason hashcode can be overridden is so you can have some control over these collisions if you need to.


It should be. If every object had a 64 bit id, that was guaranteed to be unique, it would make so many things, so much easier and practical. That we're 25 years into this and there is still ambiguity is not helpful. I'll bet they could do that and have nice entropy in the id's and hashtable performance.


>Any large HashTable in Java starts to yield the problem of duplicate keys, it's just a weird situation, like you can 99.999% trust something ... but can't ever fully trust it so that over time, you're guaranteed to have something wrong.

hashCode() is a prehash function the outputs of which need to be mapped further to the (typically much smaller) number of buckets in a hash table of certain size (which would depend on the number of objects currently in the table), those "duplicate keys" are not a problem, they're how hash tables work in any language. Objects' hashcodes are used to find the relevant bucket, then this bucket is properly examined using equals(). HashMap and Hashtable are backed by arrays which have the max size of Integer.MAX_VALUE (minus some change) in JVM anyway, so those would need to be indexed by an int. I hope this helps to overcome the trust issues you have with Java data structures.


I think I miscommunicated.

I understand hashtables effectively work from 'hashes' which imply collisions etc..

I'm so used to using the term 'hasthable' I forgot that it implies a specific implementation, I should have use the term 'Map' or 'Key/Value' table, I'm resigned to having used the terms interchangeably too often.

The notion of 'hashes' which can produce 'collisions' creates a bunch of unnecessary concerns and complications given the ultimate objective of a hashtable, i.e. as a key-value store.

If every object had a guaranteed unique global id, which we could use as a key, then this would provide a lot of clarity and avoid problems. Of course the word 'hash' doesn't really even belong in the context of the higher level abstraction of key-value store as it's implementation specific.

Unfortunately, Java uses the word 'identity' in the System.identityHashCode which is really confusing. It's not really an 'identity'. It's misleading and I bet tons of Java devs are unaware (or forget). There's actually a bug on it [1]

A few years ago, I had to spend a day down this rabbit hole, as many devs have to and it's just unnecessary. A better use of I think would really help.

[1] https://bugs.openjdk.java.net/browse/JDK-6321873


Love to read about these implementation details!


There is a caveat the article doesn't mention, it's not possible to use both synchronized/lock elision and System.identityHashCode. The 4bytes in the object header are either used for the hashCode or the threadId(not java.lang.Thread.getId()), holding the monitor. +2 control bits.

Also the stuff has not changed for the past... 14y or so. It's pretty old news.



Can you clarify what is not possible or provide a link please?

I haven't heard of this, and don't understand what you said well enough to google it myself.


That's an article[0] from Dave Dice (a quick search for "biased lock java hashCode header)

Edit: Found a copy[1] of the early Cliff Click's blog... For whatever reasons google doesn't have it listed (so I typed the url. cliffc.org is empty as well.. Other than that Java's hotspot source is available too.

[0]: https://blogs.oracle.com/dave/biased-locking-in-hotspot

[1]: https://www.h2o.ai/blog/biased-locking/


Thanks!

So am I correctly understanding this as:

Using identityHashCode() does not prevent locks from working, it merely prevents the performance optimization of biased locking (which Java 15 disables by default anyway)?


Something like that. The biased locking is an overall a software fix for a hardware shortcoming, notably CAS being slow (and need for a fence on the read part, but that's pretty much free on the most of the hardware nowadays).

Lack of biased locking could be quite detrimental to CHM wide use. OTOH it'd mean a smaller header/memory footprint, which is nice.

If you are interested in the Java internals Cliff Click's (who is the original hotspot architect) insights of a decade ago will worth your while.


"It is a frequent source of bugs to change the object in such a way that its hashCode changes after it was used"

(citation needed)

I have never seen that. 20 years in the game.

I can't remember ever having seen anyone use a mutable object for a hashmap key.


It's all private code, so I can't give you a citation, but I've often helped junior devs with bugs due to this.

The most common way to get there starts with creating a "bean" object. And then, since it's a bean, it has mutable getters and setters. On everything, because that's just the standard way that people are (or at least were) taught to do these things in Java. And then you ask your IDE to give you equals and hashCode implementations, and yeah sure whatever just use all the fields. And then, later on, you need some sort of lookup table that cross-references the data modeled by this bean with some outside source of information. So of course that's a HashMap. And then you go through your objects and update them all based on that information. Maybe you just edit the description or the lastUsed field, who knows. All sorts of possibilities here. In any case, there's your bug.


I've seen it as well.

Usually it takes the form of "I added object x to a HashSet, then while processing things I update a field in object x used in the hashcode. Object x no longer appears in the set".


>I have never seen that. 20 years in the game.

I have seen that a bit too many times, it's pretty terrible. People debug and print HashMap contents, and complain the thing is broken... as get(Object) just returns null, finding an empty bucket.

I guess, you have been a bit more lucky than I was. Definitely seen too many mutable keys.


> I can't remember ever having seen anyone use a mutable object for a hashmap key.

HashMap keys aren't the only issue, sadly. In my experience, this is way more common in HashSets when trying reduce a collection to a distinct set of items. If any of those items is shared and modified in another thread, you're in for a wonderful time debugging.

Mutability makes things so much harder to reason about. It'd be great if there were a language-assisted way to enforce immutability for collection members.


HashSets are using a HashMap behind the scenes, they put a "present" placeholder value in the map when adding an element to the set: https://github.com/openjdk-mirror/jdk7u-jdk/blob/f4d80957e89...


I'm not sure I understand your point. Yes, HashSet uses HashMap internally. But that's an implementation detail. I could have a Set<Foo> without knowing what the implementation is. If it's a TreeSet, things will work fine (as those use Comparators internally, and those don't usually rely on hashcode). If it's a HashMap, you can be boned without ever knowing why.

IMO people are aware of the "mutable hashcode" issue when using maps, because you have to think about what the key of a map is when you're defining it, which is a good opportunity to remember that Foo is mutable. When you're using a Set, there's no key, only a value, which can sometimes be used as a key as an implementation detail, but there's no mental trigger.


You mentioned HashSets as another, distinct source of problems; I wanted to highlight that these two are the same thing behind the scenes, as the backing map's key set is doing the heavy lifting. It definitely is an implementation detail, and I can't say that all JDK variants employ the same trick, but a fair few of them do, given that most of them share the same set of java.lang classes.


Just an addendum, sorry for the double post: I checked two additional implementations in the meantime.

Apache Harmony: https://github.com/apache/harmony-classlib/blob/ee663b95e840...

GNU Classpath: https://github.com/ingelabs/classpath/blob/a96698f67357fddbe...


HashSet is just a HashMap (has a field) and that's pretty much it, it's quite inefficient (both perf and memory footprint) with HashMap bit inefficient to begin with.


Same here. Mostly, the reason is that static code analysis tools are pretty good at spotting dodgy equals and hash code implementations. Also people have been using libraries and ide support to help them create good implementations or things like Lombok that generate good implementations.

These days with Kotlin data classes or Java's records or value classes, there should rarely be a need to roll your own hashcode and equals implementations.

Finally, mutability is a lot less popular these days. I feel kind of dirty every time I have to use an actual var in Kotlin. Having everything immutable by default makes mutability an opt in choice. Kotlin actually uses compiler plugins to deal with frameworks that need data classes to be mutable (like hibernate) just so you can pretend they are immutable while programming. So, accidental mutation is not a thing.


I have seen it happen many times. A lot of developers I've worked with do not understand the basics of a HashMap/HashSet. They just use the IDE to automatically create hashCode functions with mutable attributes or use classes created by other people and do not check if the hashCode implementation is correct.


Just to add another anecdotal data point: me neither. Neither in java, nor in .Net, and surely not in C++ (there were lots of other things to mess up), having around ~15 years of experience under my belt. To me it is quite elementary, almost muscle memory, to never mutate objects used as keys - and while I observed quite a few rookie mistakes (some of them from the "I" perspective), it'd never occur to me to change a key in a map.

I'd probably learn a lot from stories explaining the environment of such situations, so if you have one, please share.


We see that surprisingly rarely because the completely clueless tend to get saved by identity hashcode/equals. (and the lazy, and often even the pragmatic)

Wether or not you get to see code falling into the mutable hash trap probably depends a lot on culture and the hiring process/environment. Do people feel empowered to do the right thing? Or free to take whatever easy path that works for them? Or are they pressured/incentivized to aim for some superficial appearance metrics?


Maybe you only work on your own code more often than not but I've had discussions during code reviews with junior developers on why this is bad practice.


to be honest I can't bring to mimd any code that changes a value without recalling put(), code that does that would smell so bad you would spot it by the @author




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: