No idea why "story points" or "cups of coffee" or "shirt sizes" have much relation to time. I mean... I get it, but... many places I've worked also go to pains to say "this isn't hours, we're just estimating relative complexity". But plenty of issues are extremely complex, but may only take a few days, vs some items which are less complex, but larger (touching a number of files, or screens, etc).
With that rubric, 8 story points might take 3-4 days of focused concentration, and 3 story points might take 5-6 days of less focused but more brute force. Nowhere I've worked accepts that as legitimate, and want to redefine the language in to something that approximates time. So... why not just estimate days or hours anyway?
Both an estimate of "large shirt" and "30 hours" can have 'explicit conditions built in'. This will be 30 hours with my current understanding of the request. If that understanding changes, 30 hours will change. I don't think you need 'shirts' for that?
I can easily make commitments if the people I'm committing to are fine with a change in the dates. That's a big 'if', and not one that plays out positively most times.
This is the rub, because most places I've been at, the commitment ends up being treated as a deadline, because... that seems to be how people work. "Dec 15 with 90% confidence" becomes "dec 15" and other parties start making plans and decisions based on "dec 15" without any consultation or being looped in to the process, and when 'dec 15' has to become 'jan 10', many many people are impacted and generally upset.
Because humans are extremely bad at estimating time, which is borne out by studies many times over. A good overview is the original research on this, for which Kahneman et al won a Nobel prize (yes, the same Kahneman who would later go on to write HN favorite, "thinking fast and slow"). The broad stroke is, the very best time estimators in the very best circumstances only underestimate their time needs by 33%. The norm is more like 80%. They propose a few time estimation strategies to get around it, like "third party estimation" and "tripartite estimation". But the simplest approach (which emerged in later research) is to ask them to estimate "size" of task, and use statistical corellation to convert that to a number.
This last is hand wavy unless you're familiar with the law of large numbers, the law that makes casinos profitable. A casino cannot (without cheating) determine the outcome of a single roulette spin. But they can predict with extremely high certainty the aggregate outcome of a thousand spins. This is the same with your estimates. You can't predict the corellation to time of a single story point. As you pointed out, sometimes something that looked complicated turns out to be easy and vice-versa. But given a sufficient sample size (of estimates with a consistent corellation to time), you can predict with extreme accuracy the time for 1000 story points.
"Consistent corellation to time" is a bit of a PITA in a group, BTW. If you have developers do their own estimations individually, each one will have a different corellation to time. You would need a very large sample size to overcome that much variation. This is why so many systems encourage team estimation, so the consistency is dependent on the team dynamic, which is much more stable even when adding/removing engineers. But as I said, if it's the same person or team always writing your tasks, you can use their team dynamic instead, since their story size will be consistent.
FWIW by sufficient sample size, I mean after about 3 sprints (of any duration) you can make reasonable predictions. After 6 sprints you'll have confusing outliers, and after about 9 sprints it will be clear with some numerical weight to it.
Which brings up question 2, "the commitment ends up being a deadline". This is a human nature thing, you're right! But the problem isn't a mismatch between human nature and your estimate. The mismatch is between human nature and the uncertainty of reality. How you push to improve this is contextual to your org. In hard situations I reverse the statement of my estimate, to "if we set dec 15 as the deadline, there's a 5% chance we won't make it. What's our fallback?" Asking that question a lot is helpful. But there's no magic bullet to making leadership - or worse, people who are afraid of leadership - plan appropriately for uncertainty. The best you can do is expose the uncertainty as clearly as possible, and give lots of lead time for the times when they still run into conflict between deadline, resources, and scope. After that, it's the manager's job to "manage" things and decide which variable they will alter to break the conflict.
Put another way: reality is uncertain. When that uncertainty leads to a conflict between deadline, scope, and available resources - because that will happen sometimes per point 1 - only someone with deadline, scope, or hiring authority can solve it. That's (usually) not within your purview as a lead engineer. The best you can do is to 1) call out the uncertainty as clearly as you can, as early as you can, and 2) signal that conflict as early as you can, so those managers have maximum leeway. Abstracted estimation makes that possible. Guesses and hopes don't.
With that rubric, 8 story points might take 3-4 days of focused concentration, and 3 story points might take 5-6 days of less focused but more brute force. Nowhere I've worked accepts that as legitimate, and want to redefine the language in to something that approximates time. So... why not just estimate days or hours anyway?
Both an estimate of "large shirt" and "30 hours" can have 'explicit conditions built in'. This will be 30 hours with my current understanding of the request. If that understanding changes, 30 hours will change. I don't think you need 'shirts' for that?
I can easily make commitments if the people I'm committing to are fine with a change in the dates. That's a big 'if', and not one that plays out positively most times.
This is the rub, because most places I've been at, the commitment ends up being treated as a deadline, because... that seems to be how people work. "Dec 15 with 90% confidence" becomes "dec 15" and other parties start making plans and decisions based on "dec 15" without any consultation or being looped in to the process, and when 'dec 15' has to become 'jan 10', many many people are impacted and generally upset.