It looks like this must be the successor to Deliroll, which was described in an absolutely fantastic talk from 2013[0]. I've been hoping this would be open sourced for years. Really glad to see this!
I'm so glad AdRoll finally open-sourced this. TrailDB is by far the easiest way to process trillions of events on a single machine. Can't wait for them to start publishing their internal ecosystem around it too.
It looks like TrailDB solves a particular problem (which is actually quite common btw) the right way. However the language barrier makes it hard to use it as DB. I believe that Python binding is nice for development environment but in production we would probably want to use a low level language for performance reasons because we'll be dealing with billions of events in a single machine. Do you plan to implement a query language which compiles to native code such as SQL?
How do you handle continuous data with TrailDB? It seems that you store raw events Kinesis, buffer events and periodically write TrailDB files to S3. When you want to process events for a specific user, normally the events might be in a random TrailDB file so the timeline would be mixed up. Do you merge TrailDB files in a single instance when processing the data or have a sharding mechanism?
We have an internal framework that compiles a state machine DSL into lower level code to execute queries on TrailDBs. However the binding in Python is fairly thin, and the same can be said for the other ones like the golang one. So before optimizing with a lower level language or a special DSL I would verify that Python or any other of the supported languages doesn't satisfy your requirements.
And continuous data is handled by sharding TrailDBs across some fields in our log lines, this way all the related events for a cookie in a given day belong in the same shard, each day the shard mapping is the same and we can just download the same shard ids from S3 and process the files sequentially using our DSL language. With a bit of code you can make this whole process of downloading from S3 and processing completely automated, this is in fact what we do with our data pipeline[0][1].
It sounds like TrailDB works particularly well with storing data into S3, so what do you use to buffer incoming data before it is written in TrailDB? Kinesis?
Could not agree more, biokoda. A C lib with all the bindings offered with TrailDB is absolutely a winner. C, Python, and Go. A happy day for sure.
Thank you so much Ville and AdRoll for sharing this with the world! Your talk on the Trillion Row Data Warehouse in Python was an inspiration for much work I have done since. This library will carry that even further.
Indeed. Adroll seems to be one of those few places not totally blinded by JVM. Strong presence of D and Python on their stack seems hard to resist. Sadly, they are not in India (yet?).
And in Java 9, this embeddable C lib will be usable from within Java given http://openjdk.java.net/projects/panama/ :-D Not that JNR can't do this now, but Panama will be awesome. /end hijack
The approach here is very different. It's not that window functions are bad, but that the traditional SQL database, or even column oriented database, doesn't have a disk format that is versed to this type of analysis for the high volumes that AdRoll has. TrailDB can filter through 10s of millions of events per second on a Macbook, just storing billions of events in a more traditional DB or warehouse would use far more than the space that TrailDBs use. On top of that it's relatively easy to build visualisations on top of these databases with an interactive UI to do exploration. So the window function being complicated was perhaps just the spark but there are real advantages in speed, efficiency and expressiveness in this type of database.
Its not really about SQL or NoSQL - I believe AdRoll internally has systems that map SQL queries onto TrailDB where its needed.
Its about efficiency. TrailDB compresses the hell out of event streams grouped by some key (typically user). I think it's generally something like 100x - it does so by making certain assumptions about your data. Similarly, when operating on your event sequences, many of those operations can be performed as integer instructions on the compressed data.
TrailDB gives you order(s) of magnitude of leverage. That's the difference between doing analytics on a single machine and a cluster in many cases. There aren't too many other databases that were designed to gracefully handle tens of trillions of events, so its really not surprising that you need a different tool for that job.
Efficient data structures and algos, carefully tuned for one particular scenario.
Seems like a good candidate for the one of PostgreSQL pluggable storage engines we will likely see in the future[1].
I hope the data structures are not patented.
How does this compare to PipelineDB? I'm pretty sure the cofounders of Pipeline are ex-AdRollers, so there should be a lot of overlap since it was made with the same problem in mind.
It's not quite the same thing, PipelineDB is a database to execute streaming queries at massive volumes. That is definitely one of the challenges at AdRoll.
TrailDB is more about a different way of grouping and events, and granularly querying and analysing each trail of data. TrailDBs are materialized and stored in S3 typically.
I'd also add that PipelineDB excels at SQL-based workloads where you know the queries you want to run in advance and want to stream large volumes of data into PipelineDB's continuous query engine and store ONLY the results of the continuous queries in PipelineDB's underlying relational database. The main value is in continuous computation and distillation for realtime reporting and realtime monitoring and alerting use cases. PipelineDB isn't designed to do ad hoc, exploratory queries, although it can to the same extent that PostgreSQL 9.5 can. It's designed for scenarios where you know the analytic queries you want to run in advance and where a SQL-based approach to streaming analytics with integrated storage provides value by being simpler than building a custom system in Java / Scala using frameworks like Storm, Spark Streaming, Druid, Cassandra / HBase, etc.
Pipelinedb.com is a great product and very complementary to TrailDB.
PipelineDB is all about querying streaming, real-time data in SQL whereas TrailDB is great for computationally intensive analysis of historical data using any programming language.
I'm assuming this is primarily used for analytics ? Metamarkets another ad company open sourced Druid for a similar purpose. I'm curious as to the differences in these solutions.
AFAIK Druid is for time series and it's a columnar database. Their format has dimensions and then pre-aggregated fields on those dimensions. Afterwards you can run SQL queries on that data format to get full aggregations in return.
This is a library to read and write a data format that is optimized to give access to granular events and actors within an event stream. For example this could be used to trail all of the events generated by one entity (credit card, cookie, email, account and so on) over a dataset. At that point you can choose what you want to do with it: extract features for ML, train ML directly on raw data, run arbitrary queries for outliers and anomaly detection and what have you.
Very interesting. I wonder how this compares to writing events to a Parquet (http://parquet.io/) or Avro (or both) file - this way all manners of distributed tools like Hive or Spark could already read it.
Parquet is a columnar format, so it's optimized for aggregations over columns filtered by a set of dimensions. TrailDB data format is optimized for discrete event analysis without aggregation: data points are granularly grouped by their source (for example a user account and actions on a website) and your queries operate on each of these groups independently. No aggregation happens in TrailDB.
At which point in the curve of users-in-your-app it becomes better to use aggregate queries like this instead of just looking at each user individually?
That's just what the example script does. In the database all the data is stored granularly so you can, and we (AdRoll) do, use TrailDB to examine individual trails one by one after you select the subset that you are interested in.
TrailDB is designed for discrete events (think JSON objects) rather than continuous metrics (like CPU utilization).
Most time-series databases manage to store a large number of data points by aggregating data over time. Discrete events can't be aggregated. Instead, TrailDB leverages predictability of events to compress data over time.
I've not yet used any of these, but I think InfluxDB stores discrete events. In fact, your description sounds very similar to this comparison of Prometheus and InfluxDB[0]:
> InfluxDB is better geared towards the following use cases: Storing all individual events, not just time series of values. E.g. storing every HTTP request with full metadata vs. storing the cumulative count of HTTP requests for certain dimensions.
However, last time I looked that didn't seem like a use case InfluxDB invested a ton of effort optimizing for, especially if you tried to store a lot of separate "event histories" either as separate measurements or by using tags. By "a lot" I mean 100K-100M histories. It may have improved since.
Besides, there is somewhat fundamental tradeoff between allowing efficient realtime granular writes, which I believe a priority for InfluxDB, and building efficient indexed store for millions of histories/event trails, which is more of a TrailDB use case. Kind of OLAP vs OLTP.
Well let's see. Perhaps, processing social media data, analysis of stock market data, looking at weather or other observational data, working with log events, handling IoT data streams, ... and on, and on. If you look at the world, an amazing amount of data use cases reduce to events by time by some other dimension.
[0] https://www.youtube.com/watch?v=rXj5nayS7Yg