TrailDB – An Efficient Library for Storing and Processing Event Data

spenczar5 · on May 24, 2016

It looks like this must be the successor to Deliroll, which was described in an absolutely fantastic talk from 2013[0]. I've been hoping this would be open sourced for years. Really glad to see this!

[0] https://www.youtube.com/watch?v=rXj5nayS7Yg

Judson · on May 24, 2016

I was just watching this[0] excellent talk from one of the Heap Analytics engineers on their use of CitusDB + Postgres to do event processing.

What use-case would TrailDB be the obvious, hands-down way to go (vs maybe Postgres + Citus)?

[0]: https://www.youtube.com/watch?v=NVl9_6J1G60

jflatow · on May 24, 2016

I'm so glad AdRoll finally open-sourced this. TrailDB is by far the easiest way to process trillions of events on a single machine. Can't wait for them to start publishing their internal ecosystem around it too.

vtuulos · on May 24, 2016

Thanks! I hope we can open-source the rest of the stack now that the core libraries are out.

Btw, I am happy to answer any questions about the project here.

buremba · on May 24, 2016

It looks like TrailDB solves a particular problem (which is actually quite common btw) the right way. However the language barrier makes it hard to use it as DB. I believe that Python binding is nice for development environment but in production we would probably want to use a low level language for performance reasons because we'll be dealing with billions of events in a single machine. Do you plan to implement a query language which compiles to native code such as SQL?

How do you handle continuous data with TrailDB? It seems that you store raw events Kinesis, buffer events and periodically write TrailDB files to S3. When you want to process events for a specific user, normally the events might be in a random TrailDB file so the timeline would be mixed up. Do you merge TrailDB files in a single instance when processing the data or have a sharding mechanism?

dialtone · on May 24, 2016

We have an internal framework that compiles a state machine DSL into lower level code to execute queries on TrailDBs. However the binding in Python is fairly thin, and the same can be said for the other ones like the golang one. So before optimizing with a lower level language or a special DSL I would verify that Python or any other of the supported languages doesn't satisfy your requirements.

And continuous data is handled by sharding TrailDBs across some fields in our log lines, this way all the related events for a cookie in a given day belong in the same shard, each day the shard mapping is the same and we can just download the same shard ids from S3 and process the files sequentially using our DSL language. With a bit of code you can make this whole process of downloading from S3 and processing completely automated, this is in fact what we do with our data pipeline[0][1].

[0]: http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-d... [1]: http://tech.adroll.com/blog/data/2015/10/15/luigi.html

justinsb · on May 24, 2016

It sounds like TrailDB works particularly well with storing data into S3, so what do you use to buffer incoming data before it is written in TrailDB? Kinesis?

vtuulos · on May 24, 2016

Our infrastructure is in AWS, so S3 was a natural choice for us. You can easily use any other object store or filesystem with TrailDB.

We use Kinesis amongst other things to stream raw data to S3.

biokoda · on May 24, 2016

An easily embeddable C lib. Fantastic choice. If only more projects in this space did this. So much stuff in this area is java.

DLA · on May 24, 2016

Could not agree more, biokoda. A C lib with all the bindings offered with TrailDB is absolutely a winner. C, Python, and Go. A happy day for sure.

Thank you so much Ville and AdRoll for sharing this with the world! Your talk on the Trillion Row Data Warehouse in Python was an inspiration for much work I have done since. This library will carry that even further.

srean · on May 25, 2016

Indeed. Adroll seems to be one of those few places not totally blinded by JVM. Strong presence of D and Python on their stack seems hard to resist. Sadly, they are not in India (yet?).

vtuulos · on May 24, 2016

Thanks for you kind words!

mahmoudimus · on May 24, 2016

And in Java 9, this embeddable C lib will be usable from within Java given http://openjdk.java.net/projects/panama/ :-D Not that JNR can't do this now, but Panama will be awesome. /end hijack

cbsmith · on May 24, 2016

When will people realize you should never call a project Panama? ;-)

georgewfraser · on May 24, 2016

Are SQL window functions really so bad? Setting up a completely separate system for a particular kind of query seems like the wrong approach.

dialtone · on May 24, 2016

The approach here is very different. It's not that window functions are bad, but that the traditional SQL database, or even column oriented database, doesn't have a disk format that is versed to this type of analysis for the high volumes that AdRoll has. TrailDB can filter through 10s of millions of events per second on a Macbook, just storing billions of events in a more traditional DB or warehouse would use far more than the space that TrailDBs use. On top of that it's relatively easy to build visualisations on top of these databases with an interactive UI to do exploration. So the window function being complicated was perhaps just the spark but there are real advantages in speed, efficiency and expressiveness in this type of database.

jflatow · on May 24, 2016

Its not really about SQL or NoSQL - I believe AdRoll internally has systems that map SQL queries onto TrailDB where its needed.

Its about efficiency. TrailDB compresses the hell out of event streams grouped by some key (typically user). I think it's generally something like 100x - it does so by making certain assumptions about your data. Similarly, when operating on your event sequences, many of those operations can be performed as integer instructions on the compressed data.

TrailDB gives you order(s) of magnitude of leverage. That's the difference between doing analytics on a single machine and a cluster in many cases. There aren't too many other databases that were designed to gracefully handle tens of trillions of events, so its really not surprising that you need a different tool for that job.

snaky · on May 25, 2016

Efficient data structures and algos, carefully tuned for one particular scenario. Seems like a good candidate for the one of PostgreSQL pluggable storage engines we will likely see in the future[1]. I hope the data structures are not patented.

[1] http://akorotkov.github.io/blog/2016/04/06/extensible-access...

temuze · on May 24, 2016

How does this compare to PipelineDB? I'm pretty sure the cofounders of Pipeline are ex-AdRollers, so there should be a lot of overlap since it was made with the same problem in mind.

dialtone · on May 24, 2016

It's not quite the same thing, PipelineDB is a database to execute streaming queries at massive volumes. That is definitely one of the challenges at AdRoll.

TrailDB is more about a different way of grouping and events, and granularly querying and analysing each trail of data. TrailDBs are materialized and stored in S3 typically.

Fergi · on May 25, 2016

I'd also add that PipelineDB excels at SQL-based workloads where you know the queries you want to run in advance and want to stream large volumes of data into PipelineDB's continuous query engine and store ONLY the results of the continuous queries in PipelineDB's underlying relational database. The main value is in continuous computation and distillation for realtime reporting and realtime monitoring and alerting use cases. PipelineDB isn't designed to do ad hoc, exploratory queries, although it can to the same extent that PostgreSQL 9.5 can. It's designed for scenarios where you know the analytic queries you want to run in advance and where a SQL-based approach to streaming analytics with integrated storage provides value by being simpler than building a custom system in Java / Scala using frameworks like Storm, Spark Streaming, Druid, Cassandra / HBase, etc.

vtuulos · on May 24, 2016

Pipelinedb.com is a great product and very complementary to TrailDB.

PipelineDB is all about querying streaming, real-time data in SQL whereas TrailDB is great for computationally intensive analysis of historical data using any programming language.

angryasian · on May 24, 2016

I'm assuming this is primarily used for analytics ? Metamarkets another ad company open sourced Druid for a similar purpose. I'm curious as to the differences in these solutions.

dialtone · on May 24, 2016

AFAIK Druid is for time series and it's a columnar database. Their format has dimensions and then pre-aggregated fields on those dimensions. Afterwards you can run SQL queries on that data format to get full aggregations in return.

This is a library to read and write a data format that is optimized to give access to granular events and actors within an event stream. For example this could be used to trail all of the events generated by one entity (credit card, cookie, email, account and so on) over a dataset. At that point you can choose what you want to do with it: extract features for ML, train ML directly on raw data, run arbitrary queries for outliers and anomaly detection and what have you.

Chris2048 · on May 25, 2016

Might have to check these out. I'd be interested the match of storing data like this.

Also, another event DB project: https://github.com/benbjohnson/skydb.io/blob/master/source/b...

Leftium · on May 26, 2016

Interesting, but the open source SkyDB seems to be dead:

https://twitter.com/benbjohnson/status/590626780173721601

Chris2048 · on May 26, 2016

Actually, just reading:

> So I understand, by "pure go" you mean you've thrown out the lua, julia and ruby bits?

wow, that was quite the polyglot codebase...

Chris2048 · on May 26, 2016

Yeah :-< , although maybe a good starting point for anyone interested in the code. I kind of hope it takes off again.

gtrubetskoy · on May 24, 2016

Very interesting. I wonder how this compares to writing events to a Parquet (http://parquet.io/) or Avro (or both) file - this way all manners of distributed tools like Hive or Spark could already read it.

vtuulos · on May 24, 2016

TrailDB is optimized for a very specific data model (http://traildb.io/docs/technical_overview/#data-model) which allows it to do compression that you couldn't do with other data layouts.

TrailDB is a C library, Parquet and Avro are not. Depending on your use case this might be a pro or a con.

dialtone · on May 24, 2016

Parquet is a columnar format, so it's optimized for aggregations over columns filtered by a set of dimensions. TrailDB data format is optimized for discrete event analysis without aggregation: data points are granularly grouped by their source (for example a user account and actions on a website) and your queries operate on each of these groups independently. No aggregation happens in TrailDB.

fiatjaf · on May 24, 2016

At which point in the curve of users-in-your-app it becomes better to use aggregate queries like this instead of just looking at each user individually?

dialtone · on May 24, 2016

This is a tool to look at each user individually actually, not aggregated.

fiatjaf · on May 24, 2016

http://traildb.io/docs/images/wikipedia-sessions.png

dialtone · on May 25, 2016

That's just what the example script does. In the database all the data is stored granularly so you can, and we (AdRoll) do, use TrailDB to examine individual trails one by one after you select the subset that you are interested in.

LaurenceW1 · on May 24, 2016

This shows incredible promise for being used as the basis for an Event Store type database system.

atombender · on May 25, 2016

Looks very promising. Someone really should build a distributed server on top of this.

ademarre · on May 25, 2016

How does this compare to something like InfluxDB and other time-series databases?

vtuulos · on May 25, 2016

TrailDB is designed for discrete events (think JSON objects) rather than continuous metrics (like CPU utilization).

Most time-series databases manage to store a large number of data points by aggregating data over time. Discrete events can't be aggregated. Instead, TrailDB leverages predictability of events to compress data over time.

ademarre · on May 25, 2016

I've not yet used any of these, but I think InfluxDB stores discrete events. In fact, your description sounds very similar to this comparison of Prometheus and InfluxDB[0]:

> InfluxDB is better geared towards the following use cases: Storing all individual events, not just time series of values. E.g. storing every HTTP request with full metadata vs. storing the cumulative count of HTTP requests for certain dimensions.

[0] https://prometheus.io/docs/introduction/comparison/#promethe...

oavdeev · on May 25, 2016

Yes, schema wise that's pretty close.

However, last time I looked that didn't seem like a use case InfluxDB invested a ton of effort optimizing for, especially if you tried to store a lot of separate "event histories" either as separate measurements or by using tags. By "a lot" I mean 100K-100M histories. It may have improved since.

Besides, there is somewhat fundamental tradeoff between allowing efficient realtime granular writes, which I believe a priority for InfluxDB, and building efficient indexed store for millions of histories/event trails, which is more of a TrailDB use case. Kind of OLAP vs OLTP.

ralusek · on May 24, 2016

Any plans for NodeJS client?

fenguin · on May 25, 2016

Played around with it today. Made some progress here: https://github.com/poynt/traildb-node

Let me know if you have any use cases that are in the API but not covered yet by this library :)

vtuulos · on May 25, 2016

We were just talking about it earlier today. It should be easy to do with https://github.com/node-ffi/node-ffi

Contributions are welcome :)

erichocean · on May 25, 2016

How important is Judy in achieving the level of performance you reached?

vtuulos · on May 25, 2016

Good question. Judy arrays tend to be really efficient for low-level mappings like the 128-bit -> 64-bit map (https://github.com/traildb/traildb/blob/master/src/judy_128_...) that we use to keep track of UUIDs.

It would be interesting to benchmark Judy against another similar well-optimized data structure.

mrits · on May 24, 2016

if you are considering this I'd check out druid.io

dedalus · on May 24, 2016

Nicely Done!

nametakenobv · on May 24, 2016

traildb-r link to gihub is 404. Checked the account. No repo for R :(

vtuulos · on May 24, 2016

It is coming soon, sorry for the delay!

vtuulos · on May 25, 2016

You can find the R bindings now at https://github.com/traildb/traildb-r

nametakenobv · on May 25, 2016

Oh you really meant soon. Great, thanks!

chinathrow · on May 24, 2016

What other use cases are suitable with this setup than obnoxious ad-tech?

DLA · on May 24, 2016

Well let's see. Perhaps, processing social media data, analysis of stock market data, looking at weather or other observational data, working with log events, handling IoT data streams, ... and on, and on. If you look at the world, an amazing amount of data use cases reduce to events by time by some other dimension.