Hacker News new | past | comments | ask | show | jobs | submit login
TrailDB – An Efficient Library for Storing and Processing Event Data (adroll.com)
212 points by vtuulos on May 24, 2016 | hide | past | favorite | 54 comments



It looks like this must be the successor to Deliroll, which was described in an absolutely fantastic talk from 2013[0]. I've been hoping this would be open sourced for years. Really glad to see this!

[0] https://www.youtube.com/watch?v=rXj5nayS7Yg


I was just watching this[0] excellent talk from one of the Heap Analytics engineers on their use of CitusDB + Postgres to do event processing.

What use-case would TrailDB be the obvious, hands-down way to go (vs maybe Postgres + Citus)?

[0]: https://www.youtube.com/watch?v=NVl9_6J1G60


I'm so glad AdRoll finally open-sourced this. TrailDB is by far the easiest way to process trillions of events on a single machine. Can't wait for them to start publishing their internal ecosystem around it too.


Thanks! I hope we can open-source the rest of the stack now that the core libraries are out.

Btw, I am happy to answer any questions about the project here.


It looks like TrailDB solves a particular problem (which is actually quite common btw) the right way. However the language barrier makes it hard to use it as DB. I believe that Python binding is nice for development environment but in production we would probably want to use a low level language for performance reasons because we'll be dealing with billions of events in a single machine. Do you plan to implement a query language which compiles to native code such as SQL?

How do you handle continuous data with TrailDB? It seems that you store raw events Kinesis, buffer events and periodically write TrailDB files to S3. When you want to process events for a specific user, normally the events might be in a random TrailDB file so the timeline would be mixed up. Do you merge TrailDB files in a single instance when processing the data or have a sharding mechanism?


We have an internal framework that compiles a state machine DSL into lower level code to execute queries on TrailDBs. However the binding in Python is fairly thin, and the same can be said for the other ones like the golang one. So before optimizing with a lower level language or a special DSL I would verify that Python or any other of the supported languages doesn't satisfy your requirements.

And continuous data is handled by sharding TrailDBs across some fields in our log lines, this way all the related events for a cookie in a given day belong in the same shard, each day the shard mapping is the same and we can just download the same shard ids from S3 and process the files sequentially using our DSL language. With a bit of code you can make this whole process of downloading from S3 and processing completely automated, this is in fact what we do with our data pipeline[0][1].

[0]: http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-d... [1]: http://tech.adroll.com/blog/data/2015/10/15/luigi.html


It sounds like TrailDB works particularly well with storing data into S3, so what do you use to buffer incoming data before it is written in TrailDB? Kinesis?


Our infrastructure is in AWS, so S3 was a natural choice for us. You can easily use any other object store or filesystem with TrailDB.

We use Kinesis amongst other things to stream raw data to S3.


An easily embeddable C lib. Fantastic choice. If only more projects in this space did this. So much stuff in this area is java.


Could not agree more, biokoda. A C lib with all the bindings offered with TrailDB is absolutely a winner. C, Python, and Go. A happy day for sure.

Thank you so much Ville and AdRoll for sharing this with the world! Your talk on the Trillion Row Data Warehouse in Python was an inspiration for much work I have done since. This library will carry that even further.


Indeed. Adroll seems to be one of those few places not totally blinded by JVM. Strong presence of D and Python on their stack seems hard to resist. Sadly, they are not in India (yet?).


Thanks for you kind words!


And in Java 9, this embeddable C lib will be usable from within Java given http://openjdk.java.net/projects/panama/ :-D Not that JNR can't do this now, but Panama will be awesome. /end hijack


When will people realize you should never call a project Panama? ;-)


Are SQL window functions really so bad? Setting up a completely separate system for a particular kind of query seems like the wrong approach.


The approach here is very different. It's not that window functions are bad, but that the traditional SQL database, or even column oriented database, doesn't have a disk format that is versed to this type of analysis for the high volumes that AdRoll has. TrailDB can filter through 10s of millions of events per second on a Macbook, just storing billions of events in a more traditional DB or warehouse would use far more than the space that TrailDBs use. On top of that it's relatively easy to build visualisations on top of these databases with an interactive UI to do exploration. So the window function being complicated was perhaps just the spark but there are real advantages in speed, efficiency and expressiveness in this type of database.


Its not really about SQL or NoSQL - I believe AdRoll internally has systems that map SQL queries onto TrailDB where its needed.

Its about efficiency. TrailDB compresses the hell out of event streams grouped by some key (typically user). I think it's generally something like 100x - it does so by making certain assumptions about your data. Similarly, when operating on your event sequences, many of those operations can be performed as integer instructions on the compressed data.

TrailDB gives you order(s) of magnitude of leverage. That's the difference between doing analytics on a single machine and a cluster in many cases. There aren't too many other databases that were designed to gracefully handle tens of trillions of events, so its really not surprising that you need a different tool for that job.


Efficient data structures and algos, carefully tuned for one particular scenario. Seems like a good candidate for the one of PostgreSQL pluggable storage engines we will likely see in the future[1]. I hope the data structures are not patented.

[1] http://akorotkov.github.io/blog/2016/04/06/extensible-access...


How does this compare to PipelineDB? I'm pretty sure the cofounders of Pipeline are ex-AdRollers, so there should be a lot of overlap since it was made with the same problem in mind.


It's not quite the same thing, PipelineDB is a database to execute streaming queries at massive volumes. That is definitely one of the challenges at AdRoll.

TrailDB is more about a different way of grouping and events, and granularly querying and analysing each trail of data. TrailDBs are materialized and stored in S3 typically.


I'd also add that PipelineDB excels at SQL-based workloads where you know the queries you want to run in advance and want to stream large volumes of data into PipelineDB's continuous query engine and store ONLY the results of the continuous queries in PipelineDB's underlying relational database. The main value is in continuous computation and distillation for realtime reporting and realtime monitoring and alerting use cases. PipelineDB isn't designed to do ad hoc, exploratory queries, although it can to the same extent that PostgreSQL 9.5 can. It's designed for scenarios where you know the analytic queries you want to run in advance and where a SQL-based approach to streaming analytics with integrated storage provides value by being simpler than building a custom system in Java / Scala using frameworks like Storm, Spark Streaming, Druid, Cassandra / HBase, etc.


Pipelinedb.com is a great product and very complementary to TrailDB.

PipelineDB is all about querying streaming, real-time data in SQL whereas TrailDB is great for computationally intensive analysis of historical data using any programming language.


I'm assuming this is primarily used for analytics ? Metamarkets another ad company open sourced Druid for a similar purpose. I'm curious as to the differences in these solutions.


AFAIK Druid is for time series and it's a columnar database. Their format has dimensions and then pre-aggregated fields on those dimensions. Afterwards you can run SQL queries on that data format to get full aggregations in return.

This is a library to read and write a data format that is optimized to give access to granular events and actors within an event stream. For example this could be used to trail all of the events generated by one entity (credit card, cookie, email, account and so on) over a dataset. At that point you can choose what you want to do with it: extract features for ML, train ML directly on raw data, run arbitrary queries for outliers and anomaly detection and what have you.


Might have to check these out. I'd be interested the match of storing data like this.

Also, another event DB project: https://github.com/benbjohnson/skydb.io/blob/master/source/b...


Interesting, but the open source SkyDB seems to be dead:

https://twitter.com/benbjohnson/status/590626780173721601


Actually, just reading:

> So I understand, by "pure go" you mean you've thrown out the lua, julia and ruby bits?

wow, that was quite the polyglot codebase...


Yeah :-< , although maybe a good starting point for anyone interested in the code. I kind of hope it takes off again.


Very interesting. I wonder how this compares to writing events to a Parquet (http://parquet.io/) or Avro (or both) file - this way all manners of distributed tools like Hive or Spark could already read it.


TrailDB is optimized for a very specific data model (http://traildb.io/docs/technical_overview/#data-model) which allows it to do compression that you couldn't do with other data layouts.

TrailDB is a C library, Parquet and Avro are not. Depending on your use case this might be a pro or a con.


Parquet is a columnar format, so it's optimized for aggregations over columns filtered by a set of dimensions. TrailDB data format is optimized for discrete event analysis without aggregation: data points are granularly grouped by their source (for example a user account and actions on a website) and your queries operate on each of these groups independently. No aggregation happens in TrailDB.


At which point in the curve of users-in-your-app it becomes better to use aggregate queries like this instead of just looking at each user individually?


This is a tool to look at each user individually actually, not aggregated.



That's just what the example script does. In the database all the data is stored granularly so you can, and we (AdRoll) do, use TrailDB to examine individual trails one by one after you select the subset that you are interested in.


This shows incredible promise for being used as the basis for an Event Store type database system.


Looks very promising. Someone really should build a distributed server on top of this.


How does this compare to something like InfluxDB and other time-series databases?


TrailDB is designed for discrete events (think JSON objects) rather than continuous metrics (like CPU utilization).

Most time-series databases manage to store a large number of data points by aggregating data over time. Discrete events can't be aggregated. Instead, TrailDB leverages predictability of events to compress data over time.


I've not yet used any of these, but I think InfluxDB stores discrete events. In fact, your description sounds very similar to this comparison of Prometheus and InfluxDB[0]:

> InfluxDB is better geared towards the following use cases: Storing all individual events, not just time series of values. E.g. storing every HTTP request with full metadata vs. storing the cumulative count of HTTP requests for certain dimensions.

[0] https://prometheus.io/docs/introduction/comparison/#promethe...


Yes, schema wise that's pretty close.

However, last time I looked that didn't seem like a use case InfluxDB invested a ton of effort optimizing for, especially if you tried to store a lot of separate "event histories" either as separate measurements or by using tags. By "a lot" I mean 100K-100M histories. It may have improved since.

Besides, there is somewhat fundamental tradeoff between allowing efficient realtime granular writes, which I believe a priority for InfluxDB, and building efficient indexed store for millions of histories/event trails, which is more of a TrailDB use case. Kind of OLAP vs OLTP.


Any plans for NodeJS client?


Played around with it today. Made some progress here: https://github.com/poynt/traildb-node

Let me know if you have any use cases that are in the API but not covered yet by this library :)


We were just talking about it earlier today. It should be easy to do with https://github.com/node-ffi/node-ffi

Contributions are welcome :)


How important is Judy in achieving the level of performance you reached?


Good question. Judy arrays tend to be really efficient for low-level mappings like the 128-bit -> 64-bit map (https://github.com/traildb/traildb/blob/master/src/judy_128_...) that we use to keep track of UUIDs.

It would be interesting to benchmark Judy against another similar well-optimized data structure.


if you are considering this I'd check out druid.io


Nicely Done!


traildb-r link to gihub is 404. Checked the account. No repo for R :(


It is coming soon, sorry for the delay!


You can find the R bindings now at https://github.com/traildb/traildb-r


Oh you really meant soon. Great, thanks!


What other use cases are suitable with this setup than obnoxious ad-tech?


Well let's see. Perhaps, processing social media data, analysis of stock market data, looking at weather or other observational data, working with log events, handling IoT data streams, ... and on, and on. If you look at the world, an amazing amount of data use cases reduce to events by time by some other dimension.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: