Hacker News new | past | comments | ask | show | jobs | submit login
Anna: A Fast, Scalable, Flexibly Consistent Key-Value Store (databeta.wordpress.com)
228 points by mpweiher on March 9, 2018 | hide | past | favorite | 71 comments



The link to bloom is broken: http://bloom-lang.net

This is a result of what appears to be a chain of ideas, papers and prototypes started more than eight years ago (see http://boom.cs.berkeley.edu - maybe even earlier, hard to tell since most older URLs are gone). I'm amazed that people are able to remain funded working on something with a very theoretical and long-term payoff, and incredibly thankful at the same time for the entities supporting this! Wish success to the team on releasing Bedrock.


I wonder what the performance would be like if instead of running on ruby they ran it on a scalable virtual machine with near-first-class actors (not truly first class, but the engine is optimized to handle them, and the constructs are extremely important in the standard library) like BEAM.


While Bloom was indeed coded in ruby, they use C++ for Anna:

> We get this rich consistency in Anna with a very clean codebase, by porting design patterns of monotone lattice composition from Bloom to C++.

In the paper:

> The Anna actor and client proxy are implemented entirely in C++. The codebase—including the lattice library, all the consistency levels, the server code, and client proxy code— amounts to about 2000 lines of C++ on top of commonly-used libraries including ZeroMQ and Google Protocol Buffers.


Curious to see such effective code, I looked for the source code, but found nothing.

The more approaching code base I found is https://github.com/ucbrise/LatticeFlow.

Is the Anna source repository public ?


May be it is hard to get funded into doing one thing for 10 years, it is even harder to be passionate and working on the same goal for 10 years. I means unless it is ridiculously well paid, most people would have lost interest,


Being faster (in throughput) than Redis doesn't seem that difficult, Redis is mostly single core. But with that restriction comes the magical constraint that you can run Lua on that single core without having to think about race conditions or concurrency. For many complicated systems this is an incredible property and can provide incredible power. Yes, you may eventually outgrow Redis, but if you do then you are entering pretty crazy territory where you will likely need something custom anyways.

All that to say, that particular comparison feels a bit Apples and Oranges to me.


The comparison with Redis is on page 10, and (obviously) takes into account the number of threads. It is a bit presumptuous to post a comment like this without minimal effort, please don’t.

Regarding embedded processing, it’s very easy to embed whatever single-threaded language you’d like into C: Lua, Javascript or something new like Gravity. This is orthogonal to the storage / network architecture.

A better argument would be on the extra operations and data structures that Redis offers, not being a simple key-value store.


The comparison to Redis is in the 4th paragraph of the linked article. "The paper includes numbers showing it beating Redis by over 10x on a single AWS instance"

I'm sorry I didn't go read the original paper, but I thought reading the article qualified me to comment. Sorry dad.

I think you also misunderstood my point about Lua. Embedded Lua in Redis is so powerful BECAUSE it is single threaded, not because it is just snazzy to have an embedded language. That with the primitives Redis provides allows you to build your own domain specific data structures with their own custom semantics that just aren't possible in other systems without rolling your own. And you can do it simply.

That Anna is faster is great, but it comes with its own set of constraints, I don't think I would be wrong in saying that includes you not having exclusive access to the data while you are in an embedded script running on the store.


> "The paper includes numbers showing it beating Redis by over 10x on a single AWS instance"

Right single core against single core,the best scenario for Redis, and Anna's an order of magnitude faster. I'm not sure what you don't find impressive about this.


Ahem. Single AWS instance does not equal single core.

See my earlier note, it is a bit apples to oranges. Redis is not optimized for throughput on a machine, it is optimized for throughput on a single core. And that property allows lots of interesting things due not having any parallelism.

They are different tools and they have different performance characteristics. My only point was that comparing against Redis is somewhat misguided due to those differing goals.


The article is comparing the [new] underlying architecture of a key-value store, which Redis also is (and one of the fastest). The other features are irrelevant for the comparison. “Anna” is a proof of concept, not a competing product.


I feel like between Redis, S3, Cloud Storage, RocksDB, Cassandra, etc...this area strikes me as one that has been solved as well as we could reasonably expect it to be.

What the world of data needs more of is continued development into novel indexing strategies/implementations. ElasticSearch, Postgres's GIN index on JSONB, MapReduce, graph databases.

I don't need another key value store...


I don't mean this facetiously: did you read the article? This work presents novel ways of "solving" the problem that cause it to beat the other contenders fairly convincingly (in non-linearizable regimes).


I agree. However, I stumbled into the world of KV stores (like RocksDB, LMDB, LevelDB, etc..) last year, and what is most surprising is that they all stop in the same place. I understand that they should do one thing and one thing well, but it is still disappointing when you have to implement things like replication, sharding, and indexing yourself.

There really aren't even that many DBMS that are KV (like redis) out there to handle it either. They are normally much more complicated (like adding SQL layer on top of it).


BerkeleyDB has a replication engine. IMO that's too much of a kitchen-sink approach. Judging by how many of those KV stores utterly fail to store data reliably, that's already a hard enough problem to solve. Focusing on the local storage is a clearly delineated realm of responsibility. Distribution obviously belongs to a higher logical layer.

Indexing requires knowledge of a higher level data model. (Again, BerkeleyDB has built in support for secondary indexing, but last time I checked it was a quite braindead and slow implementation. Faster to build your own indices instead, using the other facilities provided.)

With that said, while a KV store has no logical data model to apply to index generation, it can at least provide primitives for you to construct your own indices. BerkeleyDB and LMDB do this.

Distribution with transaction support may require help from the storage engine (offering something resembling multi-phase commit). BerkeleyDB provides this already; LMDB will probably provide this in 1.0.

An argument could be made that the storage engine should be able to handle replication/distribution even without understanding the higher level/application data model. BerkeleyDB does this with page-level replication. IME this results in gratuitously verbose replication traffic, as every high level operation plus all of its dependent index updates etc. are replicated as low level disk/page offset operations. IMO it makes more sense to leave this to a higher layer because you can just replicate logical operations, and save a huge amount of network overhead.

As for the possible higher layers - antoncohen's response below gives a few examples. There are plenty of higher level DBMSs implemented on top of LMDB, providing replication, sharding, etc.


I think we're just getting to the point where it may become more common to separate the simpler problem of a single node, non-distributed 'store' with some kv interface below, and then build more complex distributed algorithms in a layer or two above. You can see some of the larger monolithic codebases that had to start by having their own code all the way up and down the stack, but now are starting to experiment with backend store interfaces so you can trade-off some of the strengths and weaknesses of various local store performance areas.

Along the same lines, a few newer codebases for distributed stores seem to be building with those delineations in mind. Another comment brought up Tidb/Tikv for example. Tikv iirc uses RocksDB as its local store.


"Just getting to the point" ? OpenLDAP has been architected this way for ~20 years. I think the same could be said for MySQL, as well as SQLServer (built on top of ESENT/JET). Large monolithic data stores are an obvious anti-pattern, reflects short-sighted design process.


Ha, you have a good point. I think my comment was directed more to "distributed cloud store" applications, much more so than databases. That begs the question - then what capabilities have those apps been focusing on that they didn't get from existing databases (or thought that they didn't get from existing databases)...


RocksDB, LMDB, and LevelDB are basically low level disk representations used by the databases that do things like provide network access, sharding, and replication. OpenLDAP (LMDB), MySQL (MyRocks), Bigtable (LevelDB-like), Riak (LevelDB), etc.

Many are or can be used as key-value stores. MySQL actually has a memcached compatible KV store, using InnoDB for storage. Postgres has HStore. A lot of the distributed databases roughly fall into the category of KV stores: HBase, Riak, Cassandra, DynamoDB, etc.


> I feel like between Redis, S3, Cloud Storage, RocksDB, Cassandra, etc...this area strikes me as one that has been solved as well as we could reasonably expect it to be

This paper literally shows this statement is false. In what other circumstances could you provide one and two orders of magnitude performance improvement, and people would just shrug and say, "meh, I really don't think we need improved performance".


All the options you mention have varying features that give it advantages and disadvantages in various scenarios. What works well in on, won't in others, etc.

That said, I was mostly interested on how this compares to etcd, since the use case seems pretty similar. How does Anna scale out (differing data centers) compared to it being a bad idea with etcd for example.


I've just finished reading the paper. One difference is that Anna offers a weaker form of consistency than etcd. Etcd offers sequential consistency, and Anna offers casual consistency. I believe that this could eliminate many etcd use cases, since ordering for non-casual events many be different depending on the node you access.

I'm not sure I fully understand etcd's durability guarantees. When an operation is completed, does that mean that the data is durable on a single machine, and then becomes durable on other machines at a later point? If so, it seems like Anna could offer the same durability guarantees.

I think Anna's architecture could be more of a competitor to Cassandra, or DynamoDB as long as you only need casual consistency. The performance implications do seem pretty interesting.


etcd does not provide key partitioning/sharding, it achieves linearizable consistency using replica state machine via Raft. Anna provides weaker consistency guarantee (causal or read committed), but has partitioning/sharding.


Cassandra is JVM GC issues

Redis is in-memory only

Cloud Storage - Not sure, how we can use it outside of cloud vendors

RocksDB - Facebook just outsourced the engine to the community, where is the service which adds replication, clustering and network interface on top of it? I am sure, they use one internally, why is it not being open-sourced?

There is also badger but most of these only offer low-level operation.

Sorry, most of my developers are unable to consume them just like Redis.


> where is the service which adds replication, clustering and network interface on top of it?

Actually I would prefer such kind of database (just the engine).

1. Expose it to network via what ever framework you like. Thrift, Rest, grpc, ... You don't have to include different kind of libraries for each network service. I would love to connect to every network service (Redis, Elasticsearch, Cassandra, MySQL, ...) via a single framework (say grpc).

2. In most large scale scenarios, there is already some kind of log service (DistributedLog, NATS, Kafka, ...). Why not take benefit of that for replication? Isn't it great to separate the engine layer from replication layer? Currently we are doing double replication actually. Replicate data from master DB to slave DB. Then replicate the same data, from any DB to cache, search, ... components. The data is already there on log. Let everyone (slave DB as well as cache/search module) consume it. This is basically state machine replication idiom. PNUTS[0], Twitter K/V database, LinkedIn Espresso [1] (as well as Ambry[2] which is their internal object store), ... use this approach for replication.

3. I would agree with that, they only support basic low level operations.

[0] http://www.vldb.org/pvldb/1/1454167.pdf

[1] http://www.csce.uark.edu/~xintaowu/BDAM/p1135-qiao.pdf

[2] http://dprg.cs.uiuc.edu/docs/SIGMOD2016-a/ambry.pdf


Agree w/ all these points (forgot "S3 has provider lock-in" though others replicate the API). I do use Cassandra for most use cases and I don't hit GC things, but I understand the concerns that come with the JVM (no, haven't subbed Scylla in yet). One that I haven't tried but want to hear the downsides of is https://github.com/pingcap/tikv (not the DB built on top, but just that one for KV). A nice published list of all cons of all database systems would be ideal.


From what I could tell, its tikv is virtually useless without the rest of the tidb (the MySQL layer that sits on top written in go) to be useful.

TiKV has no replication / sharding built in, that is actually handled by a Placement Driver (PD)

From the docs: > TiKV is a component in the TiDB project, you must build and run it with TiDB and PD together.


You can disk-back redis at various consistency levels if you so choose

https://redis.io/topics/persistence


Redis supports at least two forms of persistence: (1) periodic checkpoints; (2) writing an append-only log which can be periodically compacted to save space.


> RocksDB - Facebook just outsourced the engine to the community, where is the service which adds replication, clustering and network interface on top of it?

https://myrocks.io/


They also released MyRocks, a RocksDB engine for MySQL which will is being intergrated in MariaDB 10.3 and Percona Server 5.7. I don't think it supports replication yet, but it's a start.


Cassandra has a host of other trade offs besides java (tombstones and data garbage collection come to mind).


Rocks Storage engine in Cassandra which claims to reduce the GC stalls

https://engineering.instagram.com/open-sourcing-a-10x-reduct...


there's a new one of those posted like every day - and every single one is "crazy fast" and "super scalable"


Almost as frequent as the new hotness JS framework.


This link claims to be 700x faster than something called Masstree, which itself claims to do about 3 million requests per second with 16 cores. I'm not sure I buy 131 Million request per second per core.

> it was up to 700x faster than Masstree, up to 800x Intel’s “lock-free” TBB hash table. In fairness, those systems provide linearizable consistency and Anna does not. But Anna was still up to 126x faster than a “hogwild”-style completely inconsistent C++ hashtable due to cache locality for private state, while providing quite attractive coordination-free consistency.


So reading through this announcement, it seems the key to ANNA's speed is:

local cache (in the form of the actor's mailbox) + background gossip

The usual restrictions (and increases in latency) still apply when you want to make sure that something's actually written (quoruming) after you've written it, from what I can tell.

Can someone explain to me why this is a step forward for the field -- I haven't yet read all the papers they linked to (including their ANNA papers), but this doens't seem to be one of those times where a bunch of disparate papers are combined into creating something truly groundbreaking?

I feel like I must be missing the point


I don't think that's correct. The real key to its performance seems to be the usage of distributed lattices as data structures, and the ability to perform compile-time checks that guarantee the data will be eventually consistent, both of which allow code to be completely lock-free. This comes from the CALM paper cited in the article - lots of reading to do!


I thought so before too, but after reading the paper again, I came to the same conclusion as the parent. The usage of distributed lattices is key, but it only works because it allows them to reduce messaging cost and gossip at background intervals. As far as I can tell, this means that you can receive a successfully written response, have that machine die , and all data within the last multicast period is lost. Therefore, it isn't suitable as a datastore, and the benchmarks are mostly worthless with the exception of Redis.


Super duper late, but I still haven't had time to read any of the papers (there are like 4 if I really want to get anywhere close to understanding their spin on gossip + the lattice thing) -- glad the discussion is still interesting though.

I'm starting to think that the quorum strategy is something like a theoretical lower bound -- at least until someone brilliant figures out a way past it (or technology shifts in some gigantic way or something).



Has there been a public release of the source for this?


Looks interesting, especially about coordination free consistency and lock free structures !!


Claims in the blog post (orders of magnitude faster than the current state of the art systems, universal linear universal scalability from threads to many nodes, dimissing Dean's rule of redesign after x10 scale) seem overblown to me.

What have they really built: a purely in-memory KV store that doesn't support synchronous secondary writes for durability. So, any comparisons with ACID KV stores, either disk based (Cassandra, Mongo) or in-memory, are not apples-to-apples comparisons from the beginning. What could be production applications of such system, other than cache?

On their benchmarks: they don't really compare with state of the art.

Selection of competitors in the single-server, multi-core benchmark doesn't include systems like https://github.com/fastio/pedis. Also, they still use 100 millisecond granularity of gossip (within a single server!), while for all other compared systems corresponding metric could be evaluated as nearly 0 by construction, that gives Anna a huge edge.

In multi-node benchmark, they claim 10x over Cassandra. ScyllaDB (https://www.scylladb.com/) claims the same, while being ACID and linearizable, unlike Anna. Also, Anna achieves stronger consistency levels by holding off reads, that kills latency, given 100 millisecond gossip granularity. If it applies only to their multi-key consistency (Read Committed/Uncommitted) it's probably OK, because I suppose that there is no magic bullet that allows to preserve super low latencies and providing similar consistency in Scylla either. But if Anna needs to hold off reads for any of their claimed single-key consistency levels (all of which are weaker than linearizable), that's worse than Scylla. The authors of the paper didn't detail the algorithm for each consistency level.

Seems like the authors don't benchmark multi-node scalability of Anna on any consistency levels except the weakest, simple eventual consistency. It would be interesting to see if Anna scales as well on stronger consistency levels.

To me, the main outcome of this paper is another confirmation that shared-nothing, thread-per-core, message passing designs are beneficial in the modern computing environment. This is not new, however, see H-Store, Scylla/Seastar, Tarantool (https://github.com/tarantool/tarantool), Aeron (https://github.com/real-logic/aeron), Tempesta (https://github.com/tempesta-tech/tempesta), etc.

Novelty is the framework that generalizes thread/node scalability, different consistency levels reusing the same codebase, and having just a single knob - gossip granularity. Practical applications are limited. Certain techniques are probably going to be cherry-picked by systems such as Redis Cluster and In-Memory Data Grids.


Can you point me to where Scylla claims to be ACID or linearizable? As far as I know there is no Paxos implementation yet, not that Cassandra's LWT implementation is anything to write home about.


Anna paper itself says that Cassandra and Scylla are Linearizable per-key. Yes, obviously Scylla is not ACID, sorry for my loose usage of this term. I was referring to durability, i. e. "D" from ACID.


Looks exciting but very light on details.


There's a lot more detail in the linked paper.

http://db.cs.berkeley.edu/jmh/papers/anna_ieee18.pdf


Can Anna's approach improve current solutions to the problem of managing secondary indexes in a partitioned KV store while preserving consistency?


I don't think this problem exists outside of the realms of linearizable or serializable consistency, which this system doesn't provide.


> Totally ordered request processing requires waiting for global consensus at each step, and thus fundamentally limits the throughput of each replica-set.

really? such global consensus increases latency (network round trip plus fsync write), with a fully batched and pipelined concurrent design, when CPU cycles are being saturated in those benchmarks, why throughput is fundamentally limited by such increased latency?


Let's say we have a primitive system with total ordering and single static coordinator per each replica-set. And we want to update a global counter from every node. Now since counter is global it lives on a single replica-set and every node has to communicate with this one coordinator. Be it 4 nodes or 4000, still one coordinator. If we drop total ordering and use something like gossip protocol to communicate between nodes, we can merge this counter everywhere in the system as it propagates, eliminating all unnecessary communications and distributing communications in the network. So, yeah, global consensus fundamentally limits both throughput and latency.


That's a really good question... I think the answer is something like this (just spitballing here, not trying to make a strong claim):

At some point in a system, you may need a response to one request in order to generate the next request. At this point latency affects throughput.

Also, apparently, their "lattice composition" technique let's them push concurrency to a lower level, which would allow them to avoid the overhead of doing concurrency at a higher level. What I mean is, if you have to process items one-at-a-time in a replica, then to process more items at a time, you need more replicas. But each replica has overhead -- the details depend on what technology we're talking about, but there's overhead to each replica... e.g., to utilize 4 cores on your server you might host four replicas on that server... but now you have to maintain four memory spaces as well, which takes cycles, even though all you wanted was to make use of all your cores to process items. And it might not be four cores, but 18 or 36 or perhaps 1000s. At some point, the cost of the overhead is greater than amount of benefit.


If anyone is interested in looking at the benchmarks or implementation: https://github.com/cw75/tiered-storage is from the main developer and it looks similar to Anna.


Rant:

Why so much focus on Key Value stores? That's the easy part of the problem.

I would like to know more about the interesting ones: secondary indexes, range scans, performance on mixed workloads, robustness, operational complexity.


Key-value store is not the focus of this research. And the problems it focuses on are not even in the same league as those you are interested in.


> Why so much focus on Key Value stores? That's the easy part of the problem.

There aren't enough good/fast/reliable ones


Is that a CAP joke or what?


Does Redis not count?


Redis is an impressive piece of engineering, but it performs best as an in-memory kv-store on a single core. Its distributed capabilities target a different problem than other distributed kv-stores attempt to solve. Redis Cluster focuses on reasonable functionality for an in-memory store. However, Redis Cluster is neither highly available, nor consistent. There are multiple modes that can cause catastrophic data loss, so Redis Cluster works best in situations where losing data isn't a big deal. For it's intended use case, nothing else comes close to offering the same functionality, performance, reliability, and ease of use.


R-trees (and their higher-dimensional counterparts) and B/B+-trees are great for supporting range queries. You’re right that kvs systems are limited, and I find the above quite effective at supporting more general operations.


CockroachDB is implemented on top of a key / value store. I don’t know too much about the internals, but a kv store with lots of degrees of consistency seems like a good primitive to build more complex data stores on top. For instance, indexes could live as immediately consistent sharded keys, while data stays eventually consistent (or user configurable).


Here is my 2 cent on why this is the case:

On a general purpose hardware/network adding features as you mentioned, tends to be seriously difficult and tremendously workload dependent. At some point you need help from hardware designers which itself limits the use cases of your system.

FPGA supported databases, In-network SQL processing, Server-less caches (an SSD directly connected to network without OS), Storage systems with mind blowing low energy consumption, ... are just bunch of absolutely amazing research that are being done today. however the limited public availability is badgering.

With mass public usability in-mind, a general purpose K/V system is the highest summit you can really achieve.


There is already a database product called Bedrock, using SQLite.

http://bedrockdb.com/


Can anyone with a good understanding of latest db research share their thoughts on the paper?


Thats not a very lucky name IMHO, given that there already exists a data storage related software called Hanna (https://www.sap.com/products/hana.html)


no link ?



Why this fad with naming apps as persons? Found this: http://fortune.com/2014/12/22/startup-names-human/


Anna is a hummingbird. Known for its fastest relative speed.


Why this fad with naming birds as persons?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: