Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] Show HN: Weave - actually measure engineering productivity (workweave.ai)
22 points by adchurch 71 days ago | hide | past | favorite | 39 comments
Hey HN,

We’re building Weave: an ML-powered tool to measure engineering output, that actually understands engineering output!

Why? Here’s the thing: almost every eng leader already measures output - either openly or behind closed doors. But they rely on metrics like lines of code (correlation with effort: ~0.3), number of PRs, or story points (slightly better at ~0.35). These metrics are, frankly, terrible proxies for productivity.

We’ve developed a custom model that analyzes code and its impact directly, with a far better 0.94 correlation. The result? A standardized engineering output metric that doesn’t reward vanity. Even better, you can benchmark your team’s output against peers while keeping everything private.

Although this one metric is much better than anything else out there, of course it still doesn't tell the whole story. In the future, we’ll build more metrics that go deeper into things like code quality and technical leadership. And we'll build actionable suggestions on top of all of it to help teams improve and track progress.

After testing with several startups, the feedback has been fantastic, so we’re opening it up today. Connect your GitHub and see what Weave can tell you: https://app.workweave.ai/welcome.

I’ll be around all day to chat, answer questions, or take a beating. Fire away!




"Hello Jane, please have a seat. We need to talk about your productivity. Yes, I know you helped the team through a crunch and delivered the new feature, which works flawlessly and is loved by our users. And our balance sheet is much healthier after you found that optimization that saves us $1mm/year. We also appreciate that younger teammates look to you for guidance and learn a lot from you.

But you see, the AI scored your productivity at 47%, barely "meets expectations", while we expect everyone to score at least 72%, "exceeds expectations". How is that calculated? The AI is a state of the art proprietary model, I don't know the details...

Anyways, we've got to design a Personal Improvement Plan for you. Here's what our AI recommends. We'll start with the TPS reports..."


LOL - I shudder at the idea of a manager making HR decisions based solely on this one metric!

To be clear we're not claiming this is 1 number to holistically evaluate an entire engineer. Rather we're giving a much more accurate picture of output, which most orgs are already measuring (with terrible accuracy). It should be an important part of the picture but certainly not the whole story!

And fwiw I think the scores are pretty transparent - in the platform, you can drill down into any number and see the actual PRs and their output measurements. Of course the underlying model is more complex but unfortunately simpler models are not sufficient to capture the way engineering output works.


Shudder is a fair reaction. Except the manager is now forced by hr to make a hr decision because that way too level "leadership" wouldnt be culpable!


How did you come up with those magic correlation numbers?

Is this generally just sniffing surface quality and quantity of written code, or is consideration given to how architecturally sound the system is built, whether the features introduced and their implementations make sense, how that power is exposed to users and whether the UI is approachable and efficient, user-feedback resulting from the effort, long-term sustainability and technical debt left behind (inadvertently or with deliberation), healthy practices for things like passwords & sensitive data, etc?

I'm glad to see an effort at capturing better metrics, but my own feeling is trying to precisely measure developer productivity is like trying to measure IQ - it's a flawed errand and all you wind up capturing is one corner of a larger picture. Your website shares zero information prior to login, and I'm looking forward to you elaborating a little more on your offering!

EDIT: Would also love to hear feedback from developers at the startups you tested at - did they like it and felt it better reflected their efforts during periods they felt productive vs. not, was there any initial or ongoing resistance & skepticism, did it make managers more aware of factors not traditionally captured by the alternative metrics you mentioned, etc.


> How did you come up with those magic correlation numbers?

Evaluated on a proprietary data set of manually labelled PRs

> Is this generally just sniffing surface quality and quantity of written code...

Somewhere in between the two :) a PR with a poorly and quickly implemented login will have a lower output score than a PR with a robust, well-designed and tested login, simply because the latter is more effort. But there isn't (yet!) a metric to quantify the relative quality. So our metric doesn't tell the full story, but it gives more info than would have previously been available.


A hotfix PR that was 2 lines long, prevents (or managed) an outage incident, but was merged with no title, description, or review because time was of the essence.

Some of such changes have been my most impactful ones.


Our metric is approximately "hours of work for an expert engineer." Here are some example open source PRs and their output metrics calculated by our algorithm:

https://github.com/PostHog/posthog/pull/25056: 15.266 (Adds backend, frontend, and tests for a new feature)

https://github.com/microsoft/vscode/pull/222315: 8.401 (Refactors code to use a new service and adds new tests)

https://github.com/facebook/react/pull/27977: 5.787 (Small change with extensive, high effort tests; approximately 1 day of work for expert engineer)

https://github.com/microsoft/vscode/pull/213262: 1.06 (Mostly straightforward refactor; well under 1 day of work)


Curious how these numbers correlate to the estimates of the engineers behind the PRs?

For example, the first PR is correlated with ~15 "hours of work for an expert engineer"

Looking at the PR, it was opened on Sept 18th and merged on Oct 2nd. That's two weeks, or 10 working days, later.

Between the initial code, the follow up PR feedback, and merging with upstream (8 times), I would wager that this took longer than 15 hours of work on the part of the author.

It doesn't _really_ matter, as long as the metrics are proportional, but it may be better to refer to them as isolated complexity hours, as context-switching doesn't seem to be properly accounted for.


Yeah maybe "expert engineer" is the wrong framing and it should be "oracle engineer" instead - you're right that we're not accounting for context switching (which, to be fair, is not really productive right?)

However ultimately the meaning isn't the absolute number but rather the relative difference (e.g. from PR to PR, or from team to team) - that's why we show industry benchmarks and make it easy to compare across teams!


That assumes all or almost all the work is writing the code, with no time allotted to actually using the app with that code written, benchmarking or other measurements, research about possible alternatives, etc.


Not at all! The algorithm is calibrated with real human effort. So find/replacing something 1000 times will have nowhere near the same value as adding 1000 lines of new code. And given 1000 lines of new code, you'll get the same value for implementing the same functionality in 100 lines instead.

What we don't capture is any product or communication overhead - however our platform has other metrics which can help find if these are causing inefficiencies :)


In a complex, mature system, a high impact bug could have a very small fix that is highly non-obvious. Your metric assumes that the person shitting out 1000 lines of a new feature no one wants is equally as productive as a distributed systems wizard who can fix bugs no one else can figure out adding a 3 line fix for an issue that customers have been complaining about for years. It is inherently biased towards adding new features and against maintenance and system quality improvement.


If you build something that doesn't solve problems with impact to the business, your real productivity is zero. How does this account for that?

https://blog.pragmaticengineer.com/the-product-minded-engine...


Agreed. Focusing on outputs leads to busy bodies and doesn't necessarily mean better productivity, just more busyness. "We generated all this output over the past sprint! Nevermind the outcome that had new business value only required a few hours and a days to delivery... But we managed to take 5 times longer to get that to the customer!" I'd rather engineers slow down and think before just producing noise to show they were thinking. I see knowledge work management and business leadership is falling back to vanity again.


You're totally right! And we do not account for it at all. We don't give any insights into product decisions, just engineering output. If engineering output is high but business results are bad, that indicates there's a problem somewhere else in the business. But it's still good to know that engineering output is high!


As soon as people know how the metric is calculated, they will game that metric and it will cease to be useful.


This metric approximately measures functionality added, so the way to game it would be to add lots more functionality.

This metric has no opinion on the nature of that functionality (i.e. it's not evaluating product decisions). So it doesn't tell the whole story, but it tells a much more accurate story than LOC or whatever other metrics people are using currently!


You are describing a feature factory where people ship features just to ship features. Shitloads of code gets shipped and yet there's no real forward momentum because none of it has any impact. It is a bad place to be.


Wouldn't this be considered a very high performing engineering team with a very poor product team? We're only evaluating engineering - not the product decisions behind that engineering.


Sure, a siloed organization. Some folks care about the whole process not just the work in front of them. In fact, one of the biggest demotivators and morale destroying things is doing work that isn't meaningful (and the person isn't just about making money for themselves which would make the work meaningful to them (some of us don't care to be mercenaries...).


And that's why runners don't track their race times.


If the metric is COINCIDENT with the goal, then it's not a problem. You can't get better at winning races without improving your race time (well... you can degrade everyone else's times I suppose).

But if the metric is a PROXY for the goal, then the metric becomes the objective (not the actual goal).

In this case, whatever this AI is measuring is going to be exactly what every dev drops into their Github Copilot prompt instructions.


Making a metric an objective is an effective if often very costly way of testing the hypothesis that it is coincident rather than a proxy (which in cases more complex than racing is frequently a matter of dispute.)


Not all runners are competitive runners. I tack my time as a measure of progress among many other measurements, but my objective is to feel better and live a saner life.

Metrics are harder on software engineers, the good ones delete code, the best ones make sure useless code never gets written in the first place. How do you measure that?


> Metrics are harder on software engineers

This is exactly the problem we're hoping to solve :')

> the good ones delete code

Our algorithm gives lots of credit for deleting old code!

> the best ones make sure useless code never gets written in the first place

This we don't really capture yet - but our hope is to continue along and give insights into this vital work as well down the line.


Let me just ignore my natural distain to the whole thing (as a engineer and a manager)

> We’ve developed a custom model that analyzes code and its impact directly...

This is a bold claim all things considering. Don't you need to fine tune this model for every customer as their business metrics likely vastly different? How do you measure the impact of refactoing? What about regressions or design mistakes that surface themselves after months or even years?


Different companies will have different outputs just by the nature of their stage & situation, but the numbers are still relatively comparable (e.g. across different teams).

> How do you measure the impact of refactoing?

The metric def gives credit for refactoring

> What about regressions or design mistakes that surface themselves after months or even years?

Not captured (part of why it's only an important part of the story, not the whole story :))


> Let me just ignore my natural distain to the whole thing (as a engineer and a manager)

I totally get this - that's how I felt initially, but I was shocked to find that the vast majority of orgs are using bad metrics like LOC or commit counts anyway. Our belief is that replacing those with something much more accurate can help the entire industry.


I worked as a manager in a big tech company that used metrics such as amount of PRs done, amount of PR reviews and etc. At least with those when there were questions based on those metrics - I was able to contextualize and explain to upper management why for a particular eng they show a dip even when their performance is stellar.

How am I going to do that with your blackbox metrics when this need arises?

Also I don't have a Google account, so I can't even get pass your frontpage that has no info?


I'm looking forward to developers setting up LLM prompts to make their code seem more complex and like it required more effort.


What do you see as the major threats to validity for your approach?


Pretty dumb to think you can infer effort from the code itself. You make one "smart invocation" to a remote microservice and replace 1000 lines of code!

The information for effort is not available at the code level - sorry to burst your bubble.


Our algorithm gives lots of credit to removing old lines of code!

But to your broader point, I think there certainly is information about effort at the code level. Consider for example these two PRs: https://github.com/PostHog/posthog/pull/23858 and https://github.com/microsoft/vscode/pull/209557. It's pretty easy to tell which one was more effort even if you don't know anything about the process for how they were implemented.

Do you have any shareable examples you want me to test out? Or of course you can try it yourself :)


Hey HN! I'm one of the co-founders of Weave, and I wanted to jump in here to share a bit more.

Building this has been a wild ride. The challenge of measuring engineering output in a way that’s fair and useful is something we’ve thought deeply about—especially because so many of the existing metrics feel fundamentally broken.

The 0.94 correlation is based on rigorous validation with several teams (happy to dive into the details if anyone’s curious). We’re also really mindful that even the best metrics only tell part of the story—this is why our focus is on building a broader set of signals and actionable insights as the next step.

Would love to hear your thoughts, feedback, or even skepticism—it’s all helpful as we keep refining the product.


Skeptic here. How can you validate the difference in effort from a startup where growth happens in explosive moments with many rewrites in between vs a refined enterprise codebase with incremental changes? Is it productive if I have tried many changes in branches and none of them made it to prod?


Startups will naturally have higher output than enterprises for this reason - we'll show people benchmarks accordingly.

> Is it productive if I have tried many changes in branches and none of them made it to prod?

Our metric measures displacement, not distance - under the assumption that the end state is the part that matters the most. It will notice if the resulting change has a higher cognitive load and evaluate it accordingly - but if there is no resulting change then ultimately there's no output to measure.


I'd like to add that you need way more information on the landing page before I'm going to do much more than let you have my email address (if that.) Right now its a black box that takes in data(?) and spits out... something?


Check out our main landing page: https://workweave.dev/

Let me know if you have any questions that aren't answered there!


I just want to inform you that the pricing section is effed up. It talks about FramerBite pricing - which I guess is the thing you used to throw this landing page together. That seems very low effort and I would estimate the output metric of that to be 1.03 with a correlation of 0.96.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: