Hacker News new | past | comments | ask | show | jobs | submit login
Seriesly - a document-oriented time-series database written in Go (dustin.github.com)
142 points by mgrouchy on Sept 11, 2012 | hide | past | favorite | 34 comments



BTW, my demo site is running in my bedroom on a slow ARM5-based debian box over my terribly slow DSL. If things get slow, that's why.

The web server itself is a homemade http server I wrote in go and jokingly called "nging". Doing a lot of SSI and transfer compression can be a bit much for that machine, but it got easier than maintaining my nginx config on upgrades.

I happened to have DB logs up and some some queries that weren't me and traced it back here. Good morning, HN.


If I may ask, why Go? As an excuse to exercise your Go skills, or did you pick it for some specific reason? (In other words, were you doing this for Go, or doing Go for this?)

All I saw on the entry was a blurb about its concurrency features.


What else would I use? I've been writing tons of go code for nearly three years now. I get fast, concurrent, parallel code in very little time.

I was pretty much production-ready with seriesly in two weeks by myself (though I'm starting to get contributions from other users). Last night, I closed my last open issue "bulk interface" by making an optional memcached binary protocol interface with custom packets for database selection and streaming data in.

Today, I started a new project with a new guy, and got some pretty impressive internal demos working after a couple hours of work.

I get lots of things done fast and reliably. This doesn't happen as much for me in other languages. I went into more details in the follow-up post where I described how I built seriesly (and keep in mind, I wrote this after it had only been alive for two weeks): http://dustin.github.com/2012/09/13/inside-seriesly.html


Sorry, I didn't know your history. I was truly curious to know if this was a problem you decided to solve strictly so you could use Go, or if that was just your, pardon me, "go to" language. It appears the latter. I hope I'm not implying that you SHOULDN'T have used it; my query was truly just curiosity, since I'm curious about the language. I'm trying to get a feel for what other people are doing with it "for real".


This a very cool project that can be very useful in certain specific use cases. One example would be when you are working with medium size time-series business data.

When data set is small - RDBMS or Mongo is fine. When data set is big - the general advice is to go Cassandra, Hadoop, HBase and friends, or similar NoSQL route, which is a bit of pain to deal with.

Where I think this tool fits nicely is when you don't want/have to invest time in big NoSQL cannons, but your data/data processing is of NoSQL like nature and big enough to be painful using existing solutions.

It would be really cool to have support for the following features [please keep in mind that I am a finance person, not a developer ;-)]:

1. Ability to query data in batch [in addition to live] mode. Mode shall be transparently managed by application code. Example: my dataset is too big, I am fine with delays, I run data processing [aggregation, etc] as a scheduled task during nightly maintenance window.

2. Ability to define aggregation schema that can be saved as schema in the database. This is very useful for the following use case: I know my typical aggregation pattern, so I define it in the database schema, I schedule a batch processing task, which aggregates data according to the already defined db schema and saves the results into the database. When I need to query data later on, I can use already pre-computed data instead of live query each time. This feature is very important from ease of use perspective, the ease of use is coming from db handling this instead of application.

3. Transparent easy way to manage availability - master-slave | master-master and automatic failover.

4. Sharding data automatically and/or easily across available nodes.

5. Some way to ensure that data is never lost.

I really hope that this project will get rolling and expand.

P.S.: shoot me a note if you need me to elaborate...


This is great feedback. You seem to get what I'm going for.

Thoughts on your specific items:

1. I could probably prioritize the query/doc processing and get most of this out of the way, or something like what I've been thinking about for #4.

2. I've thought about this one for sure. It's actually possible to do externally already, just not very magically. I'll learn more when I get more internal people pushing it.

3. I've been tempted to add replication -- not because I need it, but because it's just really easy. master-slave is completely trivial. master-master isn't hard, but requires a tiny bit of state to be tracked I don't have an easy way to do yet. It'd be worth it just for fun.

4. I have a lot of infrastructure for this. To be efficient, I need something like _all_docs that doesn't include the values and/or something like get that evaluates jsonpointer. Then you could pretty well round-robin your writes and have a front-end that does this last-step work. Harvest the range from all nodes concurrently while collating the keys. Once you find a boundary from every node, you have a fully defined chunk and can start doing reductions on it. A slightly harder, but more efficient integration is to have two-phase reduction and let the leaves do a bunch of the work while the central thing just does collation. You wouldn't be able to stream results in that scenario, though.

5. Is this as simple as disabling DELETE and PUT (where a document doesn't exist)?


Hi Dustin, I've sent you an e-mail to @spy.net to continue the discussion. Is this your working e-mail?


Nice. I did a timeseries db as well and just recently put it up on google code. I started trying to wrap rrdtool at first and got sufficiently frustrated with it's api that I wrote something that wraps sqlite3 instead. Works great and uses dygraphs to graph the feed.

http://code.google.com/p/gomonitor/


As a side question, what exactly is the benefit of a time-series database vs a regular relational database? Is it just a special case or a relational database, where rows are stored based on time, so that time-based queries are more efficient? Or is there some other fundamental difference that makes it more useful?


I understood the need for a time-series server when I was developing a real-time DAU metric server.

There's a cool hack

http://blog.getspool.com/2011/11/29/fast-easy-realtime-metri...

But not as cool as a time series db.


The issue is that treating dates as keys makes it hard to model normal intuitions about time when constructing queries.

You might find this talk informative: http://www.cs.nyu.edu/shasha/papers/jagtalk.html


Thanks for sharing. Working with the same concept for the last few days. Cube (mongoDB/nodeJS rest server), http://square.github.com/cube/, really gave me a kickstart on simple datacapture and quering. (I use highcharts for the visualization).


That is nifty. Does anyone know of any high performance time series-orientated databases (closed source or not) which support flexible/advanced querying (like being able to find patterns) ?


TempoDB (http://tempo-db.com/) is a TechStars Cloud company that is doing a hosted time-series database. Really smart guys.

I'm building an open source behavioral database (time series over objects) called Sky (https://github.com/skydb/sky). It comes with a LLVM-backed language called Qip that is a mix between procedural & declarative and provides easy integration with C libs so you could plugin a machine learning library (relatively) easily. It's fast too. It'll crunch through tens of millions of events per second on a single core.

I'm releasing the initial v0.1.0 at the end of the month. Shoot me a message on Twitter (@benbjohnson) if you want some more information.


Thank you for sharing your code! This is really cool stuff.

How important of a goal was it to follow/diverge from pp/sql's lead in designing Qip?


I'm not sure that I know what pp/sql is. As far as diverging from SQL, the data I want to analyze is really separate paths of actions performed by distinct objects (e.g. users). it's not relational or tabular data so I ran into issues trying to use a language like SQL to query it.


It was pl/sql, I'm sorry. I got autocorrected. PL/SQL is a procedural language and a superset of sql. It's what stored procedures etc. tend to be written in.

Thank you for your response!



What kinds of advanced things are you wanting to accomplish?


Finding trends for example


I love the name.


It's a shame this project chose to duplicate the name of an existing project/website:

https://github.com/stefanw/seriesly

http://www.seriesly.com/


There are two hard things in computer science. The next blog post I'm planning is related to cache management.


After that I suggest a blog post on the other hard thing in computer science, off by one errors


> The next blog post I'm planning is related to cache management.

I think you mean cache invalidation. ;)


Looks like a good concept, I've wondered why there isn't better time series databases that are open source


I think there are some good ones. I wrote this because I could get it running faster than I could get the data I've got adapted to existing ones. That doesn't mean they're bad as much as it means I don't understand the data I've got. :)

The way I like to think about document-oriented databases is that you store what you have when you have it, and worry about what it was later when you need to get things back out of it.

e.g. the big bag of stuff I mentioned in the blog post contains a few things I know I don't need, a few things I think I probably need, and a lot of stuff I just don't want to think about (I might need it later, maybe after some manipulation, etc...). Lob it all in.

The downside of a system like seriesly vs. a system like rrd (or any modern equivalent) is the same as the downside of any nosql database vs. a sql database. By planning up front, I can keep the size down and get more performance by incrementally computing stuff from the beginning. In the meantime, I'll just buy more disk. :)

I'm reasonably happy with the performance, though. There's a good number of visitors on the page right now and this is what they're seeing:

    2012/09/11 12:28:50 Completed query processing in 82.54ms, 6,266 keys, 1,280 chunks
That means that for that query, it scanned through 6,266 keys in the on-disk b-tree, grouped them into 1,280 separate result "rows" to be reduced and did the necessary computation to emit all of them in under a tenth of a second while lots of other queries were in flight. My "extreme" cases right now are taking under 3 seconds on over half a million keys. I consider that acceptable for two weeks of side-project.


What are some of the others? The last time I looked, MonetDB and LucidDB seemed to be the most popular column-store open source projects, but they seem to have been mostly subsumed by proprietary products.


Whisper (backend for graphite), cube, ganglia... possibly more. I used to build things directly on top of rrdtool, but the schema definition can be a pain when you've got a lot of dynamic data.


We make a time series database service: http://tempo-db.com Although not open source, many people who've struggled with open source tools end up using us.


TempoDB is a great service. Andrew and his team are some smart guys.


Thanks Ben!


I don't know about a database but I use the zoo and xts packages in R quite heavily and they are pretty good.


theres http://opentsdb.net/, most the time I think people just write a custom thing on top of cassandra or equivalent though.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: