This a very cool project that can be very useful in certain specific use cases. One example would be when you are working with medium size time-series business data.
When data set is small - RDBMS or Mongo is fine. When data set is big - the general advice is to go Cassandra, Hadoop, HBase and friends, or similar NoSQL route, which is a bit of pain to deal with.
Where I think this tool fits nicely is when you don't want/have to invest time in big NoSQL cannons, but your data/data processing is of NoSQL like nature and big enough to be painful using existing solutions.
It would be really cool to have support for the following features [please keep in mind that I am a finance person, not a developer ;-)]:
1. Ability to query data in batch [in addition to live] mode. Mode shall be transparently managed by application code. Example: my dataset is too big, I am fine with delays, I run data processing [aggregation, etc] as a scheduled task during nightly maintenance window.
2. Ability to define aggregation schema that can be saved as schema in the database. This is very useful for the following use case: I know my typical aggregation pattern, so I define it in the database schema, I schedule a batch processing task, which aggregates data according to the already defined db schema and saves the results into the database. When I need to query data later on, I can use already pre-computed data instead of live query each time. This feature is very important from ease of use perspective, the ease of use is coming from db handling this instead of application.
3. Transparent easy way to manage availability - master-slave | master-master and automatic failover.
4. Sharding data automatically and/or easily across available nodes.
5. Some way to ensure that data is never lost.
I really hope that this project will get rolling and expand.
P.S.: shoot me a note if you need me to elaborate...
This is great feedback. You seem to get what I'm going for.
Thoughts on your specific items:
1. I could probably prioritize the query/doc processing and get most of this out of the way, or something like what I've been thinking about for #4.
2. I've thought about this one for sure. It's actually possible to do externally already, just not very magically. I'll learn more when I get more internal people pushing it.
3. I've been tempted to add replication -- not because I need it, but because it's just really easy. master-slave is completely trivial. master-master isn't hard, but requires a tiny bit of state to be tracked I don't have an easy way to do yet. It'd be worth it just for fun.
4. I have a lot of infrastructure for this. To be efficient, I need something like _all_docs that doesn't include the values and/or something like get that evaluates jsonpointer. Then you could pretty well round-robin your writes and have a front-end that does this last-step work. Harvest the range from all nodes concurrently while collating the keys. Once you find a boundary from every node, you have a fully defined chunk and can start doing reductions on it. A slightly harder, but more efficient integration is to have two-phase reduction and let the leaves do a bunch of the work while the central thing just does collation. You wouldn't be able to stream results in that scenario, though.
5. Is this as simple as disabling DELETE and PUT (where a document doesn't exist)?
When data set is small - RDBMS or Mongo is fine. When data set is big - the general advice is to go Cassandra, Hadoop, HBase and friends, or similar NoSQL route, which is a bit of pain to deal with.
Where I think this tool fits nicely is when you don't want/have to invest time in big NoSQL cannons, but your data/data processing is of NoSQL like nature and big enough to be painful using existing solutions.
It would be really cool to have support for the following features [please keep in mind that I am a finance person, not a developer ;-)]:
1. Ability to query data in batch [in addition to live] mode. Mode shall be transparently managed by application code. Example: my dataset is too big, I am fine with delays, I run data processing [aggregation, etc] as a scheduled task during nightly maintenance window.
2. Ability to define aggregation schema that can be saved as schema in the database. This is very useful for the following use case: I know my typical aggregation pattern, so I define it in the database schema, I schedule a batch processing task, which aggregates data according to the already defined db schema and saves the results into the database. When I need to query data later on, I can use already pre-computed data instead of live query each time. This feature is very important from ease of use perspective, the ease of use is coming from db handling this instead of application.
3. Transparent easy way to manage availability - master-slave | master-master and automatic failover.
4. Sharding data automatically and/or easily across available nodes.
5. Some way to ensure that data is never lost.
I really hope that this project will get rolling and expand.
P.S.: shoot me a note if you need me to elaborate...