For .NET devs: Always use something that implements IList instead of a primitive array. As Eric points out, you can make it read-only and avoid the malloc frenzy that occurs under the hood when you work with methods that return primitive arrays.
You can also inherit from any of the IList / IEnumerable / ICollection types and make customizations to better support parallelism, memory usage, etc... I've done implementations on top of IQueue (a cousin of IList) before to support custom Reader/Writer scenarios for a parallelized polling service (in-memory job queue) and it worked well.
Arrays are a perfectly good tool when you need a fast, random access, mutable data structure with dense indices.
When any of those qualifiers is not mandatory, some form of indirection-based data structure will probably be advantageous in some way, such as offering persistence, more flexible index types, more efficient access patterns, or lower risk of incorrect usage.
In practice, most situations do not require all of those qualifiers. However, the culture around many programming languages still promotes arrays as the default choice, rather than encouraging developers to think about alternatives and choose the most appropriate one in context. It's not arrays that are harmful, it's using them when they're the wrong tool for the job.
> What data structure has a more efficient access pattern?
The depends on what kind of access you will need.
For example, if you are going to be doing a lot of searching to determine whether a value is an element of a set, an array will take O(n) in the number of elements to do a linear search, while various self-balancing tree structures will be O(logn), and a hash table with a good choice of hash function will be even better. Unless you are working with very small data sets, arrays will rapidly lose out here.
If we're talking about a data structure that gets modified over time, then suppose you need to store a sequence of values in order, and you will often insert or remove elements in the middle of the sequence. A linked list or similar structure can perform these operations for O(1) cost, but an array is O(n) in the number of elements.
Of course, all data structures are ultimately kept in one big array of memory/drive, so you could argue that the array can be just as good as the other data structures in the examples above. However, I think that is a circular argument: you're just reinventing the alternative data structures.
> People default to arrays because 95% of the time they are either the best option or good enough that it's not worth worrying about.
Perhaps. Then again, perhaps this attitude of declaring the default, easy tool "good enough" instead of learning to use the right tool for the job is why in 2011 it still takes nearly a minute for my PC to boot in the morning and sometimes I still type faster than my state-of-the-art DTP software can manage to draw the letters on the screen.
Your making some major assumptions. Big O is great when dealing with large data sets but beware the K.
Deleting random elements from a linked list is an O(n) operation because you need to find the element(s). They also waste a fair amount of memory.
Determining whether a value is an element of a set is an O(Log(N)) operation on an ordered array. And faster than a self-balancing tree because of cash locality issues.
Small Hash tables are slower than arrays because of that hash function. (I love Hash Tables but they make a poor default choice.) Edit: Set operations where not what I would call an "access pattern," but it's a valid point.
PS: Buffers and Bitmaps are both well suited to an array. You can't use an self-balancing tree in a hard real time system. And so it goes, when arrays don't suit your needs there are great options, but they are less common than you might think.
> Big O is great when dealing with large data sets but beware the K.
Sure, you always have to be careful of the constant factor in reality. That's why I noted that arrays will typically lose out to alternatives like trees and hash tables unless you are working with very small numbers of elements.
Still, logarithmic beats linear very quickly as the amount of data increases. Consider that 2^10=1024, so by the time you have 1,000 or so elements to consider, you need a k that is about two orders of magnitude better for a linear scan to beat a binary tree.
For hash tables, it mostly comes down to how long it takes to compute the hash function, which in general should be a constant factor itself.
> Deleting random elements from a linked list is an O(n) operation because you need to find the element(s).
Why would you want to delete a random element? You're moving the goalposts.
If you really did need that functionality, then a simple linked list wouldn't be the data structure you wanted. Use the right tool for the job instead.
> They also waste a fair amount of memory.
They "waste" a small number of pointer-sized chunks for each element. If you have so much data that this might matter, you're probably running on hardware with silly specifications so it doesn't and/or using a different approach to scaling anyway.
> Determining whether a value is an element of a set is an O(Log(N)) operation on an ordered array.
Sure it is, after you've done your O(nlogn) sort. You're moving the goalposts again.
> And faster than a self-balancing tree because of cash locality issues.
Cache locality is certainly a valid concern in some applications, particularly with recent generations of PC processors/memory architectures. Once again, you should pick the right tool for the job.
In this case, the basic problem is that fetch time dominates comparison time. This is something database engineers have been learning to deal with since forever, and block-aware data structures like B-trees have been known for 40 years.
> Small Hash tables are slower than arrays because of that hash function.
Are you sure? Consider this: if you wanted to sort an array of 1,000,000 numbers between 1 and 100 as fast as possible, how would you do it?
But yes, if you really are within the constant factor required for the hash function, then a hash table is probably not a good choice.
> And so it goes, when arrays don't suit your needs there are great options, but they are less common than you might think.
Don't tell that to anyone who works with list-based programming languages!
As I said there are plenty of cases where Arrays suck.
However, 1,000,000 element arrays while not small are generally sorted so fast in memory that N LOG N is fine. I could write a custom hash function that would store the counts in an Array ;0, but it's not something I want to maintain so the fallback option on numbers outside the range would be sorted.
"Sure it is, after you've done your O(nlogn) sort." Creating a binary search tree is also an O(n log n) operation...
Binary search trees wins when you want to constantly updated list that you do searches on, but that's a smaller case than just looking for elements in sets. So for example: if you start with an unordered list and want to look up 50 elements to see which one is in it, you are better off with a linear search though an array, unless you can sort the list while waiting for those 50 elements.
PS: I am bending over backward to support that 95%, but when given all the constants is surprising how often an Array works just as well or better than the "optimal" data structure. Generally speaking you want large data sets in a database somewhere, locally you rarely deal with large data-sets and when you do it's often a temporary thing or a lookup table.
I like how "Arrays Considered Somewhat Harmful" becomes "Arrays are Harmful." Do away with those weasel words to make your title more catchy! (not link-baity, no no).
I don't understand the recent (from the last few years to a decade) intertubes hatred of "weasel words".
Sometimes* they are used to purposefully obfuscate or render a questionable argument virtually* unassailable.
Sometimes* they are used to express a complex reality that is not black and white. Most* real-world issues are shades of grey on top of shades of grey, mixed with some cause-effect loops, and turtles all the way down.
For computer science folks, "Considered Harmful" actually makes the title more catchy, not less. That phrase has a long history (relatively speaking) in the discipline.
Maybe you're being sarcastic/making a joke and I'm just not getting it?
Excellent point. It goes back to Djikstra's "Go To Statement Considered Harmful."
I was just pointing out that changing the title from "... Considered Somewhat Harmful" to "Arrays are Harmful" is just linkbaiting the title. At the same time, it bastardized a play on the title of a pretty important paper in Computer Science.
At this point though I feel like we're really wasting keystrokes on something silly and off-topic.
Arrays are useful in many languages. I like C arrays quite a bit and will return a huge amount of data with a single allocation. Allocations are expensive.
As the submitter and not the author, I stand nothing to gain by sharing the article. But, I see your point. I was thinking that when he mentioned not having a model where a mutable collection with fixed size made sense; thus, it translated to me to "arrays are harmful." I then considered "somewhat" to be pointless.
> Arrays simply do not model any problem that I have at all well
Then don't use arrays. However, arrays (vectors and matrices) do model problems that scientists and engineers have, and nothing else models them as well or efficiently. So to call arrays harmful is to exclude a large and important class of applications simply because you don't happen to work on them. It would be more reasonable to note there is a need for immutable arrays not met by our current languages.
Is it an array that models your problem efficiently, or is it "an n-dimensional data structure with O(1) lookups by index", of which an array is merely one implementation?
Are there any other such data structures besides arrays and hash tables? (That's only half-rhetorical. I'd love to hear about one.) If so, are you seriously suggesting that when an array fits the bill, we should stop and think "hmm... maybe another data structure would be a better container for this dense, contiguous set of integer keys"?
Const != immutable. Const in C++ gives you a read-only view into an object. You can accomplish the same goal with more fine-grained control by using interfaces in C# (or Java, which also lacks const). Instead of returning a const reference to your object, you return your object cast to an interface that doesn't have any mutating methods.
What's an example of "more fine grained control"? I've used C++ more than Java or C#, but it's not apparent to me how removing a language feature gives you more fine grained control.
A const reference in C++ is like an interface that doesn't have any mutating methods, except that all the methods of the interface have the same names and signatures as the original, and I don't have to actually type out a separate interface definition, I just choose which methods I want to appear in both the immutable and mutable interfaces by appending a "const" to their signatures (bonus: if the implementation for the const version is different, I can overload it.)
Sure, C++'s const has tons of problems, but it seems like the Java and C# designers looked at it and said "const is probably more trouble than it's worth so let's not bother." Which is a fair assessment, I guess.
> What's an example of "more fine grained control"? I've used C++ more than Java or C#, but it's not apparent to me how removing a language feature gives you more fine grained control.
The point isn't the loss of "const", it's that the only access you do have to a class-type object in C++/Java/C# is via its interface.
If you hold all of an object's current state in private data members, then the only way you can change that state -- even if you can see the object and it isn't const -- is via the interface functions provided. If as class designer you just happen not to provide any interface functions that mutate the object (no set_whatever() calls, no assignment operators, etc.) then your object becomes immutable by implication.
The finer control the GP mentioned probably refers to the way you can also choose to permit certain specific types of mutation, by providing a limited set of functionality in the interface rather than direct access to the underlying values.
Here we can give a client a write-only interface to the Mailbox, or one of two readonly (i.e. "const") interfaces that each have access to a subset of Mailbox's "const" methods.
const, of course, comes from C, where it works pretty well; unfortunately, C++ introduced features (e.g. templates) that didn't work well with const, resulting in stuff like vector::const_iterator (like vector::iterator, but for const vectors. It's completely different from vector::iterator, or so the compiler insists.)
Java is meant to be "C++ without the sharp bits", and removed this feature (and a host of other stuff); I imagine the fact that arrays are not "OO" enough has also something to do with it. C# is meant to be "Java done right/the MS way", and followed suit.
The concept of const actually comes from C++, and was back-ported to C.
I see how const_iterator can be a stumbling block, but I'd like to hear your reason for why it doesn't work well. It is different than iterator, since const_iterator accesses const values (by convention).
As soon as you start protecting yourself from the dangers of the unidentified dumbfool programmer, you get exactly this. In that case you can't do pretty much anything without wrapping it around read only collections or iterators or accessors first, whereas what you really want is just to return a list of damn things.
Java seems to be the breeding pit for such practices: it's a supposedly secure virtual machine that allows you to hide stuff so deep in your rear end that you can barely access it yourself any longer without a delicate assessment of numerous design patterns. That sort of capability allows you to get on that encapsulation-road far enough that it might not be feasible to ever back down and rewrite things in a sensible way.
But the truth is, all you need is the array holding the important values and all you need to do is document the function appropriately. "Don't modify the returned list" would be a good start.
Then you want your developers have the skills to understand each part of that sentence. On the other hand, if someone won't it's not your problem since you documented the behaviour.
Some languages, such as C, allow you to tell that more explicitly with the 'const' keyword and even have the compiler act as a gatekeeper. The developer can still write the returned value into a non-const variable and mess around with that but he's already broken the contract by that time. In Java, it should be more than sufficient to cast it into a type denoting a read-only array or collection but I don't see a common, real reason to not eventually return that single array.
> But the truth is, all you need is the array holding the important values and all you need to do is document the function appropriately. "Don't modify the returned list" would be a good start.
I agree that over-engineering is both common and unhelpful in software development. On the other hand, I think you are too quick to dismiss encapsulation techniques.
The sort of defensive programming tools you mentioned aren't there because dumb programmers don't know any better. They are there because smart programmers know that sometimes they will still make mistakes anyway.
See also: failing to free dynamically allocated memory, failing to check return codes or handle exceptions, failing to exit a loop under the correct conditions, creating deadlocks in multithreaded code, etc.
I strongly disagree with many of the claims in this post:
"An array’s whole purpose is to be a mass of mutable state. Mutable state is hard for both humans and compilers to reason about. It will be hard for us to write compilers in the future that generate performant multi-core programs if developers use a lot of arrays."
How does this conclusion follow from the premise? It also completely ignores that arrays are the natural expression of data in data parallel programs. Nearly all high performance parallel numerical code leverages arrays heavily (OpenMP, MATLAB, various GPU languages).
"If you are writing such an API, wrap the array in a ReadOnlyCollection<T> and return an IEnumerable<T> or an IList<T> or something, but not an array. (And of course, do not simply cast the array to IEnumerable<T> and think you’re done! That is still passing out variables; the caller can simply cast back to array! Only pass out an array if it is wrapped up by a read-only object.)
That’s the situation at present. What are the implications of array characteristics for the future of programming and programming languages?"
This appears to be a correct criticism of the mechanics of arrays in C#, but I don't see how that generalizes to the topic of programming languages in general. For example, we might use some sort of copy on write approach to preserve single copy performance when not changing the array is the common case.
"Clock speeds have stopped increasing, transistor density has stopped increasing"
AFAIK transistor density is still tracking Moore's law and is expected to do so for quite some time (see SIA roadmaps).
"Parallelizing in a world with observable side effects means locks"
This ignores a whole class of wait free data structures. Some of the most interesting parallel langauge tools are built around such structures. For example, cilk is some extremely smart code centered around wait free dequeues.
"The “C” programming language is all about mechanisms. It lays bare almost exactly what the processor is actually doing, providing only the thinnest abstraction over the memory model."
This is a common misconception that leads to many bugs and performance problems. Modern compilers make dramatic transformations of programs. The connection between what we assume based on the c abstract machine model and what the processor actually does is rather complex. My favorite example of this is how consuming function arguments that are pointers into temporary variables at the top of a function body actually generates shorter faster code. This is because it tells the compiler aliasing between the pointers is impossible. However, if you're using the misguided "my c is close to what the processor does" perspective you'd incorrectly assume these assignments result in a lot of wasted work.
I'm not even a c programmer and I know this. I'm quite tired of seeing this conceptual mistake from otherwise extremely knowledgeable people.
"Arrays demand imperative code, arrays are all about side effects, arrays make you write code which emphasizes how the code works, not what the code is doing or why it is doing it."
Entirely incorrect. For example, one can elegantly express many parallel algorithms as binary recursion with an associative combining operator over trees of arrays. This is a very declarative way of programming that is both elegant and maximally expresses parallelism. See the various presentations by Guy Steele for a very clear presentation of this perspective.
As a summary, I'm impressed by the posters knowledge of C# and how deeply he's considered this problem, but I think at the root of his thinking is an assumption that nearly all languages will be similar to C#. I disagree, and I think languages that center on nested data parallelism over arrays will play a major role in the future of programming.
The languages that use arrays efficiently for parallel programming (i.e. GPU languages) aren't general-purpose programming languages; programs in those languages are written with data parallelism in mind. C# is a general-purpose imperative language; it's very hard to take a program in a language such as C#, where mutation is interspersed throughout the code, and parallelise it automatically.
"I'm impressed by the posters knowledge of C#" - Eric Lippert designed and implemented large parts of the last couple of versions of the C# language and compiler.
"The languages that use arrays efficiently for parallel programming (i.e. GPU languages) aren't general-purpose programming languages"
I disagree. Typically these languages are supersets of general purpose languages like c. That is, they are c code with additional annotations (OpenMP) or use annotations with some initialization API (GPU languages). On top of that, there are purely array based general purpose languages (APL, J, K, NESL, etc).
"it's very hard to take a program in a language such as C#, where mutation is interspersed throughout the code, and parallelise it automatically."
I agree with qualifications. It's not so much the mutation that is fault, but the over specification of dependence. For example, map() makes clear that the iteration is data parallel. It is trivial to convert map() to pmap(). for() does not say anything about the iteration being data parallel, so to convert this to par() the compiler must do an analysis. Depending on the program this may require an alias analysis. While this may be possible, and perhaps even easy for common cases, to do so in general requires quite complex analysis.
In any case, I think it's perfectly possible to create a parallel mutable language. In fact, the agent model readily encapsulates mutability in a parallel and distributed way. So while I agree with the critique in the context of C#, I don't agree with how the author generalizes it to the topic of languages as a whole.
Why would sending an array back for something like the Twitter News Feed not be appropriate? Am I missing something? It's a list of tweets, so wouldn't you send back a list? That is an array?
The author considers "immutable arrays" to be different than "arrays". Due to the technical definition of an array in .NET maybe? In Obj-C you would just return an NSArray instead of an NSMutableArray and avoid the whole first part of this argument.
This is an example of bad terminology (or arguably a defective class library) clouding one's understanding. If you change the title to "MUTABLE arrays considered somewhat harmful", all of his complaints ring true. Merely changing the way arrays are defined makes the solution simple and obvious.
I am frequently impressed by the subtle aspects that Cocoa got right (another example is autorelease pools).
It's not the concept of arrays that are the problem - it's the implementation of them. As the author points out, it's easy to use arrays to accidentally create a memory hog application that doesn't perform well in multi-core environments.
It's better to abstract arrays through the use of more sophisticated classes like IList in .NET, which can mitigate the issues with traditional primitive arrays and provide developers with something that works better in modern multi-core scenarios.
Technically System.Array also implements IList, which is an interface behind which sit several implementations. The generic List type is probably worse than a plain old array, as it is really a mutable, resizable array implementation. Even the author states "one might reasonably point out that List<T> is a mass of mutable state too". Future-proofing yourself through the IList abstraction is a good practice, but I'm not sure off-hand what you would use to replace it that would be better, for now. I got the impression that the author was arguing for strictly immutable data structures, which you can achieve through diligent use of the readonly attribute and private setters, but which is not otherwise a native concept in C# or VB. I hope more of the F# immutable structures make it into the core framework in the future.
As a general practice it's bad to return List<T> for that reason - it's bloated with additional members and state to match to support just-about-anything out of the box.
Return IList instead of List on any public interface you define is the rule I try to obey.
Basically it's all about how you implement arrays in your called classes - do they expose some internal state by returning a deep clone of the array each time it's called (which is what the examples Eric used were doing) or does it expose its contents via a read-only List?
Lists are not arrays. An array is a set of data-structures sequentially arranged in memory. A list is an interface. It could be implemented as an array, or it could be a linked list, or a hash, or a b-tree, or whatever other implementation makes sense.
You can also inherit from any of the IList / IEnumerable / ICollection types and make customizations to better support parallelism, memory usage, etc... I've done implementations on top of IQueue (a cousin of IList) before to support custom Reader/Writer scenarios for a parallelized polling service (in-memory job queue) and it worked well.