Hacker Newsnew | past | comments | ask | show | jobs | submit | Daedren's favoriteslogin

Sure. Here are a few examples. They're all based on my experience with the MongoDB API for CosmosDb. Your mileage with other APIs may vary.

1. CosmosDB has a hardcoded 60-second timeout for queries. That means that queries that take longer than that are literally impossible to run without breaking the query into smaller chunks. This is worse than it sounds because CosmosDB doesn't have some of the basic optimizations that exist in other databases. For example, finding all distinct values of an indexed field required a full scan which wasn't doable in 60 seconds. Another example is deleting all documents with a specific value in an indexed field - again, not doable in 60 seconds. When deleting or updating multiple documents, we'd write short snippets of code that queried for the ids of all documents that need to be updated, and then updated or deleted them by id one by one.

2. Scaling up and down again can cause irrevocable performance changes since there's a direct link between the number of provisioned RUs and the number of "Physical Partitions" created by the database. A new "physical partition" is created for every 10K RUs or 50GB of data. CosmosDB knows how to create new physical partitions when scaling up, but doesn't know how to merge them when scaling down.

Say you have 10 logical partitions on 5 physical partitions, and you are paying for 50K RUs. Each physical partition holds exactly 2 logical partitions and is allocated 10K RUs. Now you had to temporarily scale up the database for some reason to 100K RUs, so you have 10 physical partitions with one logical partition on each one. When you scale back to 50K RUs, you'll still have 10 logical partitions, each with 5K. So now each of your logical partitions has exactly 5K RUs, while before it had 10K RUs shared with a different logical partition.

3. The allocation of logical partitions to physical partitions is static, hash-based and there's no control over it. This means that having hot logical partitions is a performance problem. Hot logical partitions might end up on the same physical partition and be starved for resources while other physical partitions are over-provisioned. Of course, you can allocate data to partitions completely randomly and hope for the best, but there's a performance penalty for querying multiple logical partitions. Plus, updates/deletes are limited to a single logical partition, so you'll be losing the ability to batch update/delete related documents.

4. Index construction is asynchronous and very slow because it uses some pool of "free" RUs that scale off the RUs allocated to your collection. It used to take us over 12 hours to build simple indexes on a ~30GB collection. Also, if you issue multiple index modification commands they will be queued even if they cancel each other out. So issuing a "create index" command, realizing you've made a mistake, then issuing a "drop index" followed by another "create index" is a 24-hour adventure. Over the next 12 hours the original index will be created, then immediately dropped, and created again. To top it off, there's no visibility into which indexes are being used, and the commands for checking the progress of index construction were broken and never worked for us.


This rhymes with my overall impression of Cosmos. It took us a while to see through the smokescreen because when talking to Microsoft support and representatives it is the Best Thing Ever and they sound so confident about it. But it really is a beta demo product sold with an alpha premium price tag.

If your traffic pattern is exactly right, and you always scale traffic up and never ever down and do not have spikes, I guess it is probably OK. The main problem is the docs are (or, at least were 2 years ago) not clear about all the caveats and restrictions but pretend it is a generic database that just works. So one has to discover all the caveats oneself.

Microsoft thinks the exact workings of the partitioning is something that should work so well you don't need to know it in detail. But, if your usecase is slightly off you end up really needing to know. I know at least one team who routinely copy all their data from one Cosmos instance to another and switch over traffic to the copy just to get a partitioning reset; it is one thing to have to do it; another to discover in production yourself it has to be done with no prior warning..

Also: The ipython+portal+Cosmos security meltdown from 1 1/2 years ago alone should be reason to look elsewhere.

(No, not a competitor, just have spent way way way too much engineering time moving first on and then off Cosmos and yes I am bitter)


Yes. The CPU and GPU demand has nothing to do with it. The reason is the car industry.

For some reason in early 2020 all the car industry execs were convinced that people would buy dramatically fewer cars in 2020, due to pandemic crashing demand. Because they have a religious aversion to holding any stock they decided to shift the risk over to their suppliers, fucking said suppliers over, as the car industry normally does when they expect demand shifts. The thing that made this particular time special as opposed to business as usual is that the car execs all got it wrong, because people bought way more cars due to pandemic rather than less, due to moving out of cities and avoiding public transit. So they fucked over their suppliers a second time by demanding all those orders back.

Now, suppose you're a supplier of some sort of motor driver or power conversion chip (PMIC) in early 2020. You run 200 wafers per month through a fab running some early 2000s process. Half your yearly revenue is a customized part for a particular auto vendor. That vendor calls you up and tells you that they will not be paying you for any parts this year, and you can figure out what to do with them. You can't afford to run your production at half the revenue, so you're screwed. You call up your fab and ask if you can get out of that contract and pay a penalty for doing so, and you reduce your fab order to 100 wafers per month, so you can at least serve your other customers. The fab is annoyed but they put out an announcement that a slot is free, and another vendor making a PMIC for computer motherboards buys it, because they can use the extra capacity and expect increased demand for computers. So far so normal. One vendor screwed, but they'll manage, one fab slightly annoyed that they had to reduce throughput a tiny bit while they find a new buyer.

Then a few months later the car manufacturer calls you again and asks for their orders back, and more on top. You tell them to fuck off, because you can no longer manufacture it this year. They tell you they will pay literally anything because their production lines can't run without it because (for religious reasons) they have zero inventory buffers. So what do you do? You call up your fab and they say they can't help you, that slot is already gone. So you ask them to change which mask they use for the wafers you already have reserved, and instead of making your usual non-automotive products, you only make the customized chip for the automotive market. And then, because they screwed you over so badly, and you already lost lots of money and had to lay off staff due to the carmaker, you charge them 6x to 8x the price. All your other customers are now screwed, but you still come out barely ahead. Now, of course the customer not only asked for their old orders back, but more. So you call up all the other customers of the fab you use and ask them if they're willing to trade their fab slots for money. Some do, causing a shortage of whatever they make as well. Repeat this same story for literally every chipmaker that makes anything used by a car. This was the situation in January 2021. Then, several major fabs were destroyed (several in Texas, when the big freeze killed the air pumps keeping the cleanrooms sterile, and the water pipes in the walls of the buildings burst and contaminated other facilities, and one in Japan due to a fire) making the already bad problem worse. So there are several mechanisms that make part availability poor here:

1. The part you want is used in cars. Car manufacturers have locked in the following year or so of production, and "any amount extra you can make in that time" for a multiple of the normal price. Either you can't get the parts at all or you'll be paying a massive premium.

2. The part you want is not used in cars, but is made by someone who makes other parts on the same process that are used in cars. Your part has been deprioritized and will not be manufactured for months. Meanwhile stock runs out and those who hold any stock massively raise prices.

3. The part you want is not used in cars, and the manufacturer doesn't supply the car industry, but uses a process used by someone who does. Car IC suppliers have bought out their fab slots, so the part will not be manufactured for months.

4. The part you want is not used in cars, and doesn't share a process with parts that are. However, it's on the BOM of a popular product that uses such parts, and the manufacturer has seen what the market looks like and is stocking up for months ahead. Distributor inventory is therefore zero and new stock gets snapped up as soon as it shows up because a single missing part means you can't produce your product.

So here we are. Shameless plug - email me if you are screwed by this and need help getting your product re-engineered to the new reality. There's a handful of manufacturers, usually obscure companies in mainland China that only really sell to the internal market, that are much less affected. Some have drop-in replacement parts for things that are out of stock, others have functionally similar parts that can be used with minor design adaptation. I've been doing that kind of redesign work for customers this whole year. Don't email me if you work in/for the car industry. You guys poisoned the well for all of us so deal with it yourselves.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: