Write a time-series database engine from scratch

ardit33 · on July 4, 2021

This is a good write up. Many years ago I decided to code/program a log based db, to see how far can I go. It was supposed to be Light-weight (Berkley DB was the inspiration). It had a working stream to file for all changes, and a compaction thread in the background.

What I learned... creating the main/basic features of non-relational database (log based), is easy... but it is the edge cases that hit you hard and dealing with them creates about 90% of the code.

It was a fun exercise from which I learned a lot.

killingtime74 · on July 5, 2021

Hey if you don't mind what are some examples of edge cases?

ardit33 · on July 5, 2021

Saturation of CPU resources

Making sure the in-memory cache doesn't get too big

Recovery when something goes wrong (computer loses power)

Recovery of data when there is a partial HDD failure

Compaction creating performance problems

Handling of duplicate requests/out of order request

etc.. etc...

Basically performance and readability and data consistency are hard, especially for a database that has a lot of concurrent users...

My toy one was super fast (I don't remember the stats on top of my head), and was good enough for good weather conditions, but I realized, in order to make it actually production ready, there was so much more to do and it was not worth for me to spend more on it, since there are so many good options out there, and SQL-Lite was becoming a de-facto standard on mobile (both iOS and Android adopted it then).

jandrewrogers · on July 4, 2021

This is a good tutorial.

The problem of out-of-order data becomes more challenging as ingest throughput requirements increase if your storage representation must guarantee a strict total order. In high-throughput designs this is often handled by relaxing the strict total ordering requirement for how the data is represented in storage. As long as the time-series has an approximate total order at ingest time, there are many techniques for inexpensively reconstructing a strict total order at query time.

roskilli · on July 4, 2021

Right exactly. As a point of reference, within M3DB each unique time series has a list of “in-order” compressed timestamp/float64 tuple streams. When a datapoint is written the series finds an encoder that it can append while keeping the stream in order (timestamp ascending), and if no such stream exists a new stream is created and becomes writeable for any datapoints that arrive with time stamps greater than the last written point.

At query time these streams are read by evaluating the next timestamp of all written streams for a block of time and then taking the datapoint with the lowest timestamp of the streams.

M3DB also runs a background tick that targets to complete within a few minutes each run to amortize CPU. During this process each series merges any streams that have sibling streams created due to out of order writes, producing one single in order stream. This is done by the same process used at query time to read the datapoints in order and they are consequently written to a new single compressed stream. This way extra computation due to our of order writes is amortized and only if a large percentage of series are written in time descending order do you end up with a significant overhead at write and read time. It also reduces the cost of persisting the current mutable data to a volume on disk (whether for snapshot or for persisting data for a completed time time window).

minitoar · on July 4, 2021

For Interana I think we end up doing this by batching writes & sorts, and not really having a strict guarantee on when data imported actually shows up in query results.

yashap · on July 5, 2021

Whatever you’re doing, it works great :) We used Interana at a previous company I worked at, and the combo of query flexibility and performance was excellent, really liked the product!

minitoar · on July 5, 2021

TY! Twitter has since acquired the Interana engineering team & IP, so we’re now doing the same thing at Twitter.

yashap · on July 5, 2021

Ah interesting, like as an internal tool there?

minitoar · on July 5, 2021

winrid · on July 4, 2021

Any cool "create a database from scratch" books? Sounds like a fun read. Like The Grapics Programming Black Book, but for databases.

Of course there's plenty of literature on individual components, but a holistic view would be fun to read.

I suppose you could read sqlite or postgres code... but that doesn't fit well on a Kindle. :)

rubyn00bie · on July 4, 2021

Yes indeed, friend! This one is a classic: https://cstack.github.io/db_tutorial/

It's a sqlite clone from scratch.

ddlutz · on July 4, 2021

Have you gone through it? I've seen it around for years, however it unfortunately looks unfinished and unmaintained. I think I'd be unfulfilled going through all of it just to not have a finished tutorial.

eatonphil · on July 4, 2021

It's not a book but I wrote a series on writing a postgres clone from scratch in Go.

https://notes.eatonphil.com/tags/database.html

bob1029 · on July 4, 2021

You could also start from first principles and essential abstractions. If you build a solid key-value store, it is possible to construct higher-order stores on top.

Once you can quickly find a thing with some key, any degree of structure and complexity is possible from that point. Column-based, row-based, document-based, etc. All are possible to build on top of K-V. Might not be the most optimal solution, but it is a very accessible path in my experience. You can start defining types like Table, Column, Row, Document, etc. which all leverage this idea that you can trivially refer to another thing by some identifier. For instance, Table.FirstRowId and Table.LastRowId could be used to refer to the logical extents of a collection of rows that might be linked together both ways (e.g. Row.PreviousRowId, Row.NextRowId).

winrid · on July 4, 2021

That's definitely a good approach. I'm fortunate enough that I can learn just by reading, so I like to just lay down and read coding/comp-sci stuff when I can.

Like, If I wanted to, I bet I know I could create the server, connection manager, query parser, optimizer, WAL, storage... and maybe I will, but would just be fine to read about someone else doing it.

I spend enough time coding as it is. :)

SilverRed · on July 5, 2021

This would be basically doing all of the research and learning on databases from the last few decades from scratch. It probably won't give you a good understanding of how a modern DB works though.

silentsea90 · on July 5, 2021

The book referenced on the blog post seems interesting:

Database Internals: A Deep Dive into How Distributed Data Systems Work https://www.amazon.com/_/dp/1492040347

azhenley · on July 4, 2021

Database Design for Mere Mortals

a-dub · on July 5, 2021

this is nice... but missing is a discussion around sync frames when using delta encoding.

typically in video compression, every once in a while a full frame image is included in the stream (mpeg calls these "i frames"). this allows for seeking and restarting after lost/corrupted data where just pure deltas require a complete replay and no gaps or corruption to recover the values.

in the case of something like this, i could imagine sync frames looking like a sentinel sequence for finding the sync frames, full value(s) for each delta encoded variable, plus possibly mini skip lists for searching (although an external index could be used for this too)

Olreich · on July 5, 2021

It seems they are relying on the partitions to act as sync frames for this implementation.

a-dub · on July 5, 2021

true. and maybe that's my bias from cutting my teeth on video compression formats, but i still think it's an interesting discussion.

by explicitly including sync frames in the data storage format itself (rather than the storage medium) a few really nice properties emerge:

1) the compression in the stream becomes independent of storage. it remains self contained both in the database and also out of the database when being sent over the network or stored in ram. mpeg very nicely can be used for storage and transmission of data in it's compressed format.

2) by decoupling it from storage, compression can also be applied to the data itself rather than just the timestamps. sure, the block/partition headers could be expanded for the data, but then this gets messier when it comes to addition and deletion of dimensions.

all of that said, for a simple tutorial on time series storage, this implementation is nice and simple.

artemonster · on July 5, 2021

Can soneone ELi5 „ Time series data tends to have a higher cardinality.“ … „ In that sense, there are few kind of metrics that are exactly the same; creating a file for each metric will cause various kinds of problems such as inode limitations.“

clappski · on July 5, 2021

There's lots of individual data points in time series data, so having a high file:data point ratio would cause issues including running out of inodes, the identifying handle of a file at the OS level.