Hacker News new | past | comments | ask | show | jobs | submit login
A distributed key value store in under 1000 lines (github.com/geohot)
187 points by Triv888 on Jan 5, 2021 | hide | past | favorite | 50 comments



I looked a bit into this code and here's a short summary:

- The API supports GET/PUT/DELETE (no range queries).

- It uses 1 master server and N volume servers.

- The master server keeps track of all keys and all requests need to go through this server. The values are stored in one or more volume servers using rendezvous hashing.

- The volume servers are just Nginx + DAV which stores data on disk.

- There's no live support for rebalancing. If you change the backend volumes, then you need to take the master server down, run a re-balancing job and then bring it up again.

- Objects are locked in the master server during modification. This means that while a server is processing PUT/DELETE all GET requests for that key will be blocked.

- There's no handling of errors from the volume servers. E.g. if one volume server fails to store data, then it won't rollback the changes done to the other volume servers nor do anything in the master server.

- The benchmarking tool (thrasher.go) does "10000 write/read/delete" in 16 threads/goroutines. This means that it generates 10k random keys (i.e. no contention at all), and for each key it does a PUT, then a GET and then a DELETE. Their result show that it's capable of handling 3815.40 keys/sec (i.e. multiply this by 3 to get the request count).


Looking quickly at the code myself:

> - The API supports GET/PUT/DELETE (no range queries).

There are range query on keys:

# list keys starting with "we" curl -v -L localhost:3000/we?list

> - There's no handling of errors from the volume servers

If a GET/PUT/DELETE fails it is communicated to the master server, who doesn't write anything in its own database

All in all I believe it does quite a lot for less than 1000 lines already. Knowing its limitations I'm curious to see how well it handles production usage


> # list keys starting with "we" curl -v -L localhost:3000/we?list

Ah, I missed this!

> If a GET/PUT/DELETE fails it is communicated to the master server, who doesn't write anything in its own database

Yes, but there's no protocol for cleaning it up. Example: If the second replica fails to store it, then the first replica will still have the data and there doesn't seem to be a way that it will be removed. This might end up "leaking" storage. Also, if you then attempt to do a rebuild it seems to me that the file will be revived.

> All in all I believe it does quite a lot for less than 1000 lines already.

Oh sure. I wasn't trying to be negative. I think it's great that people play around with these ideas! Distributed systems are complicated beast, and there's only so much you can do in 1000 lines.

Another thing I realized: It buffers all received values in-memory in the master server. You might need quite a lot of memory if you intend handle uploads of big files in parallel.


> There are range query on keys

I took it to mean the webserver doesn't support "HTTP range requests"[0], as opposed to ranges of keys.

[0] https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requ...


Maybe I'm just jaded by now, but these days whenever I see someone bragging about how much smaller their codebase is vs thing they want to replace, I start to wonder what edge cases they decided don't exist.

Like you try running this in production and 3 months later find out it instantly dies if there's more than 4% packet loss between you and the DNS server or something


It depends on the project; I haven't looked at this one in depth specifically, but it's certainly possible to write useful things with a (much) small codebase for to replace (much) larger ones. Usually the trade-offs are:

1. It solves a more limited set of use cases. It's perfectly reasonable to want to solve a wide range of use cases, but there's value in more limited software which solves a more limited set of use cases, too.

2. It caters to a more limited set of users. Suckless is perhaps the epitome and most extreme flavour of this, where they just excluded things like config files in favour of a config.h file. This doesn't make the software less stable, but it does make it less usable for some users.

For example, I am working on a CI system right now, as I ran out of Travis "open source credits" pretty fast and didn't care much for the alternatives (or Travis, for that matter, but it did work) and I figured there's some space for a somewhat different kind of CI. It's currently functional but unfinished at abut 1500 lines of code, and I'd be surprised if the finished version will be more than ~3000 lines.

It's smaller because it excludes features that are useful, but not for my (and I suspect many people's) use cases, and because it makes some assumptions that you roughly know what you're doing instead of abstracting everything. Combined, this drastically reduces the code size. This also means it's not useful for everyone, but I'm okay with that and it's a deliberate trade-off: not every project needs to solve all use cases or cater to all users.


The "small codebase" claim hinges on the fact that you're not counting the lines of code in nginx, leveldb, and other external software that are glued together to make this system. Those systems may reflect the years of battle-hardening you seek.

So, it may be more reliable than you think by virtue of that selective counting of codebase lines.


They are also using a pure Go version of leveldb (github.com/syndtr/goleveldb) and not the more battle-hardened Google leveldb, RocksDB implementations, or even the dgraph-io Badger WiscKey-style approach.


Unless someone is specifically bragging about it in a "how hasn't anyone else done it", I always see these as a coding exercise.

I watched a video of a "cabin anyone can afford and build" recently. It's 120 sq ft. He isn't saying "Why isn't everyone building their house for $2000?". It's a project to show that things people consider too complex can actually be achievable from the ground-ish up.


Also that the code is indecipherable because one of the ways they got the code size down is by playing code golf.

I wouldn't call it jaded, I'd call it experienced.


The downside of the opposite approach is wanting to use some tiny service for a dozen people and finding out it requires spinning up a data center's worth of virtual machines.


The repo readme states: `Used in production at comma.ai.` - I have no more context than that but it seems to not be totally a weekend coding exercise though he later appears to acknowledge code quality while mentioning recovery in case of a failure.


Few months ago I had issue where I needed to transfer lots of files very quickly between two drives on windows server. All the available tools were slow or had some other issues. So I wrote go program which was about 50 lines or so. I know that I didn't cover like 99% of the edge cases that there are with file transfers but for my use case it was perfect. Still running strong to this date.


It's George hotz, literally all edge cases are left to be considered as "an exercise for the reader"


And he founded a self driving vehicle startup? Oh dear...


Well at least he's not calling it autopilot(Tesla) or suggesting it to be more than it is while intentionally disabling safety features to get an unholy pile of software working(Uber).

I'm not an expert in those products but from what I saw people are usually aware that geohot's system is a work in progress, though so far he pulled it off because it's cheap(relatively), not worse than the competition and constantly updated.


Well, if I remember correctly his original suggestion was to use the mobile phone to steer the car, nothing could go wrong with that...


> nothing could go wrong with that...

https://comma.ai/ says it’s been driven 30 million miles (over years), has anything gone wrong?


Also, unlike Elon tweets (from https://github.com/commaai/openpilot/blob/devel/README.md#su...):

> openpilot ALC and openpilot LDW do not automatically drive the vehicle or reduce the amount of attention that must be paid to operate your vehicle. The driver must always keep control of the steering wheel and be ready to correct the openpilot ALC action at all times.

While changing lanes, openpilot is not capable of looking next to you or checking your blind spot. Only nudge the wheel to initiate a lane change after you have confirmed it's safe to do so.

Many factors can impact the performance of openpilot ALC and openpilot LDW, causing them to be unable to function as intended. These include, but are not limited to:

Poor visibility (heavy rain, snow, fog, etc.) or weather conditions that may interfere with sensor operation. The road facing camera is obstructed, covered or damaged by mud, ice, snow, etc. Obstruction caused by applying excessive paint or adhesive products (such as wraps, stickers, rubber coating, etc.) onto the vehicle. The device is mounted incorrectly. When in sharp curves, like on-off ramps, intersections etc...; openpilot is designed to be limited in the amount of steering torque it can produce. In the presence of restricted lanes or construction zones. When driving on highly banked roads or in presence of strong cross-wind. Extremely hot or cold temperatures. Bright light (due to oncoming headlights, direct sunlight, etc.). Driving on hills, narrow, or winding roads. The list above does not represent an exhaustive list of situations that may interfere with proper operation of openpilot components. It is the driver's responsibility to be in control of the vehicle at all times.


I think your argument makes sense but the example you gave is far too extreme. Also, isn't splitting things into small services instead of large beasts full of hidden behavior the basis of the argument in favor of microservices?


> I think your argument makes sense but the example you gave is far too extreme.

It was a bit of an exaggeration yes, but it made my point about neglecting rarer failure cases very succinctly.

> Also, isn't splitting things into small services instead of large beasts full of hidden behavior the basis of the argument in favor of microservices?

With microservices your goal should always be simplicity rather than size. Simple code is often small, but small code is not always simple.

In fact, 3 well built microservices will usually have more lines of code between them than a single monolith that does the same job would as they have to have all the boiler plate for talking to each other and accepting RPC etc.

(Also converting a large system that's hard to reason about into microservices generally results in a large distributed system that's hard to reason about - system design is hard!)


It needs to become commonplace to include your OS, external dependencies, and any language runtime code in this count. Otherwise it's just not going to evolve into a proper demoscene.


Reminds me of the first version of Redis in tcl: https://gist.github.com/antirez/6ca04dd191bdb82aad9fb241013e...


That is really cool to read! I love seeing the early seeds of big projects!


Man, that's so cool thanks for sharing !


This seems more like a _sharded_ KV-store than a distributed one, no?

A truly distributed KV-store would use a consensus algorithm to prevent the master server being a single point of failure.

For instance, CockroachDB uses Raft [0].

[0] https://www.infoq.com/presentations/cockroachdb-distributed-...


If we go by most definitions of a distributed system, say Lamport’s, it still is. It’s networked, asynchronous, has group membership and replication. Provides liveness guarantees with a coordinator for requests and locks.

GFS/NFS predates many modern data stores, had SPOF, and are considered important contributions in distributed computing.


To avoid some confusion, this is only the "distributed" part. The actual store is handled by the https://github.com/google/leveldb library.


For those interested, here is the youtube video of geohot building it: https://www.youtube.com/watch?v=cAFjZ1gXBxc


What is coded in that video is the python version of it. The actual version is written in Go.


Wow good catch indeed, thanks for pointing out


Shameless plug: I have a small YouTube series about developing a very similar distributed key-value database in Go. The sources for the database itself are at https://github.com/YuriyNasretdinov/distribkv . The database uses static sharding using powers of 2 and also requires downtime to increase the number of shards.


As usual, a tiny wrapper and huge beasts behind it, but let's count just this wrapper's code, lol. nginx and levelDb - who cares about them? :)


Do you count Python too while you're at it...


Dunno if anyone else has the same gut reaction, but Jepsen tests feel like table stakes at this point.


1000 lines... + nginx.

This looks really interesting, but I think I'd sell it as "using nginx as a key-value store" rather than a key value store in less than 1000 lines.


Yet the number of loc to be maintained is less than 1k. Following this logic we should then add standard libraries and the whole os into account.


> Following this logic we should then add standard libraries and the whole os into account.

I realise you're being snarky but systems and verification schemas like that do exist. The terms I most often associate with them is "high-assurance" or "critical supply chain" software. There's bound to be many others.

These requirements crop up in stuff like the firmware for ATM keypad. Or Google's source code (they vendor everything in their own trees). Or Vegas slot machines.

In an amusing twist, software supply chain assurance is such a massive problem that large security consultancies offer code escrow services. If you, as a software seller, can't guarantee that you'll be around 20 years down the line, the buyer can require you to submit your code to such a third-party service. Should you go out of business, the source code will remain accessible to the buyer for their future development needs.


I honestly wish people did do this for these sorts of posts because it opens up a lot of solution space and would lead to interesting optimizations along the entire stack. The boundaries between all levels fade, the pieces ready for creative recombination.

All that other code needs maintainers. Is it really that interesting to say "look how much I can punt to others?"

It's not as if anyone except the authors will use these micro libs, we might as well encourage metrics that produce interesting ones.


I wasn't so much disagreeing with the number, as much as it being the tag line.

Excluding ansible/chef recipe to manage the machine, I could deploy Redis and have no lines of code to maintain.


Go does include a reasonably capable HTTP server, so they could have done it in Go... but would you add the Go's HTTP server LOC to the count? :)


Go's HTTP server is typically imported as a package, just like `math/rand`, `time` or `fmt`. The resultant compiled binary would then be larger than if it had not been imported, but the package itself doesn't contribute to LOC.

FWIW, the app does import and run Go's HTTP server for "server" commands: https://github.com/geohot/minikeyvalue/blob/master/src/main....


If Go's HTTP server would be a standalone binary and you have to take care of it separately from the main code base, then yes.


I am working on SeaweedFS, which this simple key value store project is inspired by. This minikeyvalue is a nice simplification for George's own use case.

SeaweedFS has filePath -> fileIds -> locateVolumeServers -> lookupOnVolumeServerByFileIds. These levels of indrection give the system more control of the data management and placement. For example, one file can be mapped to multiple file ids.

minikeyvalue has key -> locateVolumeServer(by nginx) -> lookupOnVolumeServerByKey. This removed the file id indirection level. This means the data is placed on statically organized volume servers. And the whole value needs to be on one volume server. So the value can not be too large. Obviously this large value is not a requirement.

There are no need to over-engineer anything. If this is what is needed, no need to add more levels of indirection.

And no need of being critical about this project. It is efficient and it works. It is a piece of code that the author can easily fix if any problem happens. The code and the language are just tools to get the job done. His goal is autonomous driving, not a general-purpose key value store.


A distributed key/value stored backed by nginx and leveldb. Though it is impressive that these two technologies are being combined to create the store in under 1k lines.

If you have the time to go through the codebase, it was an interesting read. Nothing too surprising, but cool to see how it all comes together.


He also made a short gradient descent library, tinygrad.

https://github.com/geohot/tinygrad


Does it support consistent replication and automatic failover? (you need implement raft or paxos-like replication algorithm for this)


Writing an s3 terraform resource is also under 1000 lines of code.


> go get github.com/syndtr/goleveldb/leveldb

Okay.


1000 lines + Nginx + LevelDB

This is basically a front-end with key balancing.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: