BigQuery PM here. I'd love to genuinely understand why you have that impression....

slap_shot · on April 18, 2018

The answer comes down to one line in your BigQuery Under the Hood article[0]: "The answer is very simple — BigQuery has this much hardware (and much much more) available to devote to your queries for seconds at a time. BigQuery is powered by multiple data centers, each with hundreds of thousands of cores, dozens of petabytes in storage capacity, and terabytes in networking bandwidth. The numbers above — 300 disks, 3000 cores, and 300 Gigabits of switching capacity — are small. Google can give you that much horsepower for 30 seconds because it has orders of magnitude more."

Most companies are fine letting their warehouse go underutilized, or for queries to not be solved with such enormous resources, if it means capping their monthly data warehouse bill at a fixed number, say $9,000.

BigQuery is an awesome piece of technology, but most publishers, ecommerce, and saas companies have teams of anlayts, engineers, and business folks pounding away at their warehouse all day. And it's fine if those queries aren't as fast BigQuery.

I run an analytics companies and we load billions of events into our customers' warehouses each day. Many have evaluated BigQuery and they all came back with the same answer: too expensive. Most of them are big companies, but spending nowhere near the $40K they'd have to to cap their cost on BigQuery. And with the advent of Spectrum, they're even less likely to jump ship now.

Since you're a PM, I'd be really interested to know if you guys are aware of this issue and if you're doing anything to offer a solution that competes with Redshift (fixed cost/resource). I ask this as someone who runs a ton of stuff on GCP, but we've just never found a way to make BigQuery cost effective for us.

[0]https://cloud.google.com/blog/big-data/2016/01/bigquery-unde...

vgt · on April 18, 2018

Genuinely thanks for the info, this is super insightful.

I'll argue that BigQuery's per-query pricing charges you JUST for the resources you use (well, data scaned), so it SHOULD be far less expensive than a model that charges you for the luxury of having a cluster sit idle (and often at only 30% utilization), correct?

Can you help me unpack this further? I think pay-per-query is ultra-efficient, but difficult to predict. However, buy-a-cluster is easy to predict but inefficient. Do you think that difficulty in planning for BigQuery spend translates into perception of being too expensive (and potentially unbounded spend?), or do you think BigQuery's pay-per-query is indeed too expensive?

Most existing BigQuery customers, even at large scale, do have the option to go from pay-per-query to flat rate and back, and choose to stay on on-demand because.. well.. its much much more efficient :)

For example, if Netflix charged you a penny per minute of watchtime, you'd have no idea if it's more expensive or less expensive, but you'd be assured it's more efficient.

Feel free to ping me offline as well.

scapecast · on April 18, 2018

For background, our customers use Redshift, and we are also a heavy Redshift user for our own back-end. The ones that have done a bake-off with BigQuery (and Snowflake too, btw) come all back with the conclusion "too expensive". And if you're the BigQuery PM, I'd love to talk to you because we're planning to build for BigQuery what we've built for Amazon Redshift. lars at intermix dot io

I think the difference here is in the use case, and I think it boils down to "ETL" vs. "ELT". If you do ETL, and your analyst team runs a few ad-hoc queries, then the pay-per-query approach makes a lot of sense. It would agree that it's "ultra-efficient".

If you do "ELT", where you ingest fairly raw data on a continuous basis, and then run complex transformations within your data warehouse, then the "buy a cluster" pricing will win. The data loads (we've seen load frequency up to every 2 minutes) require resources, and then so do the transformations.

===================

side note:

When it comes to utilization, I always have to think of this great research paper by Andrew Odlyzko:

http://www.dtc.umn.edu/~odlyzko/doc/network.utilization.pdf

"Data networks are lightly utilized, and will stay that way"

===============================

Agreed that most warehouses sit at 30% utilization. The customers we work with have more utilization than 30% though. That's because they're continuously ingesting data, transforming it within Redshift, and then lots of ad-hoc queries by analyst teams and data services that feed other applications. If you have a business that runs on a two or more continents, then you don't even have downtime during US nighttime, as Europe, Asia, etc. are running. And so in those cases, we've seen that customers don't want any surprises and would rather pay for the cluster. Predictability becomes more important than paying the lowest amount possible per query.

---------

I'm a co-founder at https://www.intermix.io - we provide monitoring for data infrastructure

slap_shot · on April 19, 2018

> ...BigQuery's per-query pricing charges you JUST for the resources you use (well, data scaned), so it SHOULD be far less expensive than a model that charges you for the luxury of having a cluster sit idle (and often at only 30% utilization), correct?

> For example, if Netflix charged you a penny per minute of watchtime, you'd have no idea if it's more expensive or less expensive, but you'd be assured it's more efficient.

The Netflix analogy is actually quite good. Using that, let's say that I have a family of six, I'm billed $0.01 per minute of viewership, but I'm getting the best/quality speed. Each person watches 40 minutes a day, for all 30 days, for a grand total of $72. Far greater than the fixed cost of $10.99 (with, say, significantly less quality, speed, and minutes per month of viewership).

In the real world, the family of six is my team of data analysts, scientists, and BI folks who are querying my database from 9-5 every weekday.

A customer of mine, a large NYC publisher, who you have heard of and probably read, evaluated BigQuery in 2017. They loaded BQ with the exact same data as their Redshift cluster, pointed their Looker instance at it, and in just one day blew through 1/4 their typical Redshift budget. All the queries were faster. Way faster than they needed to be, actually.

Going back to my original post above, the issue here is that just because Google CAN throw these massive amounts of resources at my problem, doesn't mean I can afford to use that level computation for each query, or would even want to. In the Netflix example, I'm happy if Sally and Billy get less quality or limited time watching Netflix, as long as my bill stays at $10.99.

For most companies, their Redshift cluster is optimized to be able to handle their peak workload WELL ENOUGH. That means that queries won't be as fast as BigQuery - and that's totally fine. And it means that the cluster will be underutilized for large portions of the night and weekends - again, totally fine. They just need their usage capped at a predetermined cost, and have their queries finishing in a reasonable amount of time.

I've posed this Google employees before and I'm hit with "well, you can limit how much each person can query a day." Except that isn't a acceptable solution. I can't have analysts sitting around unable to query their database because they've exceeded their daily limit. They'd rather just fire off their Redshift query and if it takes a little bit longer, so be it.

manigandham · on April 20, 2018

To repeat my other comments, the issue is that BQ doesn't actually charge for compute but for data scanned. You already pay for storage (uncompressed but that's a separate issue) so paying again on that metric doesn't make sense for querying since you're using CPU cores, which is a time-based metric. If I use 1000 cores for 10 seconds vs 10 minutes, it makes sense to pay exactly that regardless of how much data was scanned.

We've run into the same issue where being curious with BigQuery actually becomes problematic as users are perfectly fine waiting an extra minute to scan 10TBs, but are afraid of the $50 bill that comes with that, especially for every little query that might be mistyped or out of their hands when using BI tools that run queries of their own.

A21z · on April 19, 2018

Netflix's analogy is great, and I would add: being charged $0.01 per minute of viewership is so stressful you would consume less, or you would be stressed about what your parents would say about the bill.

vgt · on April 19, 2018

Thanks, this is very valuable feedback, worth in gold. Please feel free to email me directly to unpack our future plans here.

The Netflix analogy also applies if they bill $0.001 per minute - the number I chose is not indicative of reality, I was just making a point. My goal was to demonstrate the fact that it's more efficient per-resource.

ryanworl · on April 18, 2018

Isn’t “buy 2000 BQ slots for 40k a month” essentially just buying a cluster?

vgt · on April 18, 2018

Yes, essentially.. a software-defined cluster (not physical). It's a parallel pricing model. To be clear, my previous post's logic is not calling for an exception for this model - you trade efficiency for predictability.

buremba · on April 18, 2018

BigQuery charges you for the storage so it's not just the queries. It also charges for the streaming inserts which is also not that cheap for large volumes of data.

I might be biased since I'm running an analytics company and this is my core business but here's our alternative:

The query and storage layer should be separated because often the expensive one the is the query layer. We have API servers which ingest the data from SDKs, enrich/sanitize and send it to a commit log such as Kinesis and Kafka. One API server is capable of handling 10k r/q.

We have the Kinesis/Kafka consumers that consume the data in micro-batches, convert them to ORC files, store it on S3 and index the metadata in a single Mysql server. The throughput is around 15k r/q.

S3 or Cloud Storage is cheap, reliable and can be used for many other use-cases as well but you need high-memory nodes for the query layer if you're dealing large volumes of data.

We have a Presto cluster which spins up when you need to run ad-hoc queries, allows you to pre-materialize the data so that you can power your dashboards. If you're running expensive queries on raw data, you can spin up 10 Presto worker, execute the queries and then just shut down them when you don't need.

That way, we're able to access all the historical data, no extra cost for the data ingestion and storage and it's even better than serverless since it will be much cheaper. The system can be automatic based on the hour of the day (during your analysts working hours) the CPU and memory load so unfortunately "BigQuery's on-demand model" is not a killing feature anymore.

vgt · on April 18, 2018

Thanks, can we continue this discussion? Super interesting to me.

One clarification. BigQuery has two methods of ingest. The Streaming API you mention does carry additional cost, but for the added benefit of having your data appear in BigQuery in real-time.

Batch load, however, is entirely free, as in - it doesn't use your "query capacity", and we don't charge for it. I am rather certain this is a very compelling offering. Batch loads encode your data, replicate it, secure it, convert it into our format, and fix any issues your files may have, like optimal file sizes, optimal number of files, optimal file groupings for your queries - lots of subtle things that either slow you down or cause you headaches otherwise. We also manage all the metadata for you (your MySQL instance). We also burn a good chunk of resources post-load re-materializing and optimizing your dataset. We also maintain upgrades, downtime, and so on.

It sounds like your use case is micro-batch, so perhaps you could benefit from our free batch loads?

It also sounds like you don't mind the extra operational overhead, and would rather operate your own stack, run your own upgrades, fix your own file issues, and so on. This is a personal preference, I agree. Customers who prefer our model love the ease of use and would rather focus their energy elsewhere. It is a stated goal of BigQuery to abstract away complexity and make BigQuery as easy to use as possible for everyone, and if we're failing somewhere, we certainly want to know :)

Finally, if I had to equate BigQuery's on-demand pricing, it gives you the ability to go from 0 cores to thousands and back to 0 in sub-second intervals, scoped down to individual query size. This is exactly what's happening under the hood, but all that is abstracted away behind a "run query" button.

buremba · on April 18, 2018

Our system is also near-realtime. The default batch duration is 3 minutes, it's also configurable but it will also affect the background processes (shard compaction, data organization jobs etc.) so our customers usually prefer 5 minutes which is also near real-time for 10B events per month.

Please note that this is our core business. Of course, we maintain these services but we also try hard to push more work to cloud providers. For example, both AWS and GC offers managed Mysql servers, auto scale groups for nodes, object stores such as S3 and Cloud Storage so we actually maintain the software and help the customers to upgrade the software when they need.

I agree that BigQuery will save time if the users are not familiar with distributed systems and big-data but again, "ability to go from 0 cores to thousands" doesn't make sense to me because in practice I have never experienced such case. Since we need to start the instances it may take up to 2 - 3 minutes but this is often acceptable for data analysts.

Could you please elaborate the part "re-materializing and optimizing your dataset."? We do a number of optimizations for compacting ORC files, bucketing etc. but I would love to hear how BigQuery does post-processing.

As a note, I usually tend to simplify things but most of the BigQuery customers that I see usually do overengineering because of the cost optimization. For example in the article the author uses Redshift for the dashboard data the solution they use for moving data from BigQuery to Redshift also needs to be maintained and it's not that easy. If I'm going to adopt my whole system to the way how BigQuery works and push hard to save costs, then I expect it to be pretty cheap but 40K for reserved slots doesn't sound like cheap to me. We maintain similar size clusters for 20% of this price for the same data volume and it's much more flexible.

manigandham · on April 19, 2018

Have you looked at Snowflake Data? They have a similar pay-per-compute setup as your Presto cluster with automatic pause/resume to save any idle time. Good middleground between traditional Redshift and flexible BigQuery.

buremba · on April 19, 2018

Yes, they have their own query engine but the idea is same; separate storage and query layer, scale the query layer on demand and charge for it. I like their approach but not everyone is willing to depend on a third party company for their company data.

manigandham · on April 19, 2018

How is that different than BigQuery then? GCP and AWS are also third party companies.

buremba · on April 19, 2018

The motivation is rather different. If you ask Redshift users, the motivation is mostly that they're already using AWS and depend on their services. That's the case for GCP as well, I would love to hear from the Google side of the story but AWS customers don't usually use BigQuery for their analytics stack because as the author explained it's not that easy. That's why BigQuery is not getting enough attraction from the outside, even though it's a cool technology.

We all agree that Redshift is not scalable at some point and people switch to other technologies when they need to but that's not the point. Creating a Redshift cluster is dead-easy and getting the data in it is also not complex compared to other solutions if you're already an AWS user.

The case for Snowflake is different, they also use AWS but they manage your services. Although they did a smart move and store the data on their customers' S3 buckets in order to make them feel like they "own" the data but it's possible only because AWS lets them to that way.

I believe that AWS doesn't try to make Redshift cost-efficient for large volumes of data because they already have the advantage of vendor-lock and making money from their large enterprise customers processing billions of data points in Redshift. That's why there are many Redshift monitoring startups out there to save the cost for you.

On the other hand, AWS is smart enough to build a Snowflake-like solution when their Redshift users start to switch BigQuery and Snowflake and the companies such as Snowflake needs to be prepared when the day comes. Cloud is killing everyone including us.

manigandham · on April 19, 2018

Snowflake has their own storage unless you opt for the enterprise version to have it stored in your own account. They also use their own table format (like BigQuery) but you can always export it.

It's true that AWS has Redshift Spectrum (and Athena) to help with more scalable querying across S3, however I don't think that makes a big risk for another company that provide a focused offering on top. Snowflake is very well capitalized with close to $500M in investment and plenty of customers so I wouldn't worry about them going out of business.

I especially like them because they have the elastic computing style of BigQuery but charge for computing time rather than data scanned, which is a much more effective billing model than anything else out there.

buremba · on April 20, 2018

What's the focused offering provided by them? Does their system more scalable, efficient or cheap or is there any technical limitations of the AWS products? Personally, I don't think $500M matters if their customers (I'm assuming most of their revenue comes from enterprise customers because that's how it works usually) churn in a few years.

manigandham · on April 20, 2018

A more focused data warehouse product that is priced on a better model and is more scalable and efficient and cheaper than any of the existing options. $500M means they'll be around longer than most of the startup customers that try them.

Anyway, at this point I'm just repeating myself so I suggest you actually try them if you care for a better model. It works for us at 600B rows of data that was too expensive with BigQuery and too slow and complicated with Redshift/Spectrum.

buremba · on April 20, 2018

Sounds fair. I'm convinced to try out their service, it's good to see that they also added pricing to their website. Let's see how it's better compared to Presto as they have the similar use-case.

wetha · on April 18, 2018

Around query pricing, my understanding is BigQuery charges by bytes scanned uncompressed. Redshift Spectrum/Athena charges by bytes scanned compressed. That makes Athena/Spectrum cheaper as well.

iwintermute · on April 19, 2018

It's not strictly true - depends on the rate. The only thing you can say for sure - uncompressed analysis on Athena is too expensive

manigandham · on April 19, 2018

They have the same rate. BigQuery and Athena are $5/TB scanned. Unless you store everything in raw JSON text, Athena will be cheaper even with most basic compression.

manigandham · on April 18, 2018

I think that's the issue, sure some companies will overprovision but as you get closer to 100% utilization, BigQuery quickly becomes extremely expensive compared to the other options, unless you're able to afford the 40k/month fee to get to flat-rate.

Also Snowflake Data is another option that supports automatic provisioning and pausing resources which is even easier to manage than redshift. Changing bigquery pricing to be compressed data stored and scanned would go a long way towards making it more attractive for full-time usage.

easytiger · on April 19, 2018

> In private data centers, it's difficult to get above 30% average efficiency.

What exactly does that mean?

nostalgiac · on April 19, 2018

Typically servers are only under a reasonable workload for load 30% of the day. Resulting in that hardware being unused (wasted) the other 70%.