I wonder how much their setup costs. Naively, if one were to simply feed 100 PB into Google BigQuery without any further engineering efforts, it would cost about 3 million USD per month.
For indexing, they need 2800 vCPUs[1], and they are using c6g instances; on-demand hourly price is $0.034/h per vCPU.
So indexing will cost them around $70k/month.
For search, they need 1200 vCPUs, it will cost them around $30k/month.
For storage, it will cost them $23/TB * 20000 = $460k/month.
Storage costs are an issue. Of course, they pay less than $23/TB but it's still expensive. They are optimizing this either by using different storage classes or by moving data to cheaper cloud providers for long term storage (less requests mean you need less performant storage and usually you can get a very good price on those object storages).
On quickwit side, we will also improve the compression ratio to reduce the storage footprint.
[1]: I fixed the num vCPUs number of indexing, it was written 4000 when I published the post, but it corresponded to the total number of vCPUs for search and indexing.
At this level they can just go bare metal or colo. Use Hetzner's pricing as reference. Logs don't need the same level of durability as user data, some level of failure is perfectly fine. I would estimate 100k per month or less, maximum 200K.
You can get a spinning disk of 18TB (not need for SSD if you can parallel write) for 224€. Let's round that to $300 for easy calculations.
To store 100 petabytes of data by purchasing disks yourself, you would need approximately 5556 18TB hard drives totaling $1,666,800.
Of course, you'll pay more than the disks.
Let's add the cost of 93 enclosures at $3,000 each ($279,000), and accounting for controllers, network equipment ($100,000), and power and cooling infrastructure ($50,000, although it's probably already cool where they will host the thing), that would be a about $2.1 M.
That's total, and that's for the uncompressed data.
You would need 3 times that for redundancy, but it would still be 40% cheaper over 5 years, not to mention I used retail price. With their purchasing power they can get a big discount.
Now, you do have the cost of having a team to maintain the whole thing but they likely have their own data center anyway if they go that route.
> disk of 18TB (not need for SSD if you can parallel write)
Do note that you can put, like, at most?, 1TB of hot/warm data on this 18TB drive.
Imagine you do a query, and 100GB of the data to be searched are on 1 HDD. You will wait 500s-1000s just for this hard drive. Imagine a bit higher concurrency with searching on this HDD, like 3 or 5 queries.
You can't fill these drives full with hot or warm data.
> To store 100 petabytes of data by purchasing disks yourself, you would need approximately 5556 18TB hard drives totaling $1,666,800.
You want to have 1000x more drives and only fill 1/1000 of them. Now you can do a parallel read!
For this purpose you would likely not buy ordinary consumer disks but rather bullet proof enterprise HDDs. Otherwise a signifcant amount of the 5556 disks would not survive the first year, assuming the are under constant load.
quickwit's big advantage is that you can target it at something that speaks S3 and it will be happy. so ideally you delegate the whole storage story by hiring someone who knows their way around Ceph (erasure coding, load distribution) and call a few DC/colo/hosting providers (initial setup and the regular HW replacements).
Good question. I thought it would be a no brainer to put it on s3 or similiar but thats already way to expensive at 2m/month without api requests.
Backplace storage pods are an initial investment of 5 Million, thats probably the best bet you could do and on that savings level, having 1-3 good people dedicated to this is probably still cheaper.
But you could / should start talking to the big cloud providers to see if they are flexible enough going lower on the price.
I have seen enough companies, including big ones, being absolut shitty in optimizing these types of things. At this level of data, i would optimize everyting including encoding, date format etc.
But i said it in my other comment: the interesting questions are not answered :D
Indeed. They benefit from a discount, but we don't know the discount figure.
To further reduce the storage costs, you can use S3 Storage Classes or cheaper object storage like Alibaba for longer retention. Quickwit does not handle that, so you need to handle this yourself, though.
Logs should compress better than that, though, right? 5:1 compression is only about half as good as you'd expect even naive gzipped json to achieve, and even that is an order of magnitude worse than the state of the art for logs[1]. What's the story there?
"Object storage as the primary storage: All indexed data remains on object storage, removing the need for provisioning and managing storage on the cluster side."
So the underlying storage is still Object storage, so base that around your calculations depending if you are using S3, GCP Object Storage, self hosted Ceph, MinIO, Garage or SeaweedFS.
Yeah, doing some preferred cloud Data Warehouse with an indexing layer seems fine for this sort of thing. That has an advantage over something specialized like this of still being able to easily do stream processing / Spark / etc, plus probably saves some money.
Maybe Quickwit is that indexing layer in this case? I haven't dug too much into the general state of cloud dw indexing.
Quickwit is designed to do full-text search efficiently with an index stored on an object storage.
There are no equivalent technology, apart maybe:
- Chaossearch but it is hard to tell because they are not opensource and do not share their internals. (if someone from chaossearch wants to comment?)
- Elasticsearch makes it possible to search into an index archived on S3. This is still a super useful feature as a way to search punctually into your archived data, but it would be too slow and too expensive (it generates a lot of GET requests) to use as your everyday "main" log search index.