Hacker News new | past | comments | ask | show | jobs | submit login
Codeball – AI-powered code review (codeball.ai)
115 points by lladnar on May 27, 2022 | hide | past | favorite | 53 comments



Explanation of results for non-ML folks (results on the default supabase repo shown on the homepage):

Codeball's precision is 0.99. It simply means that 99% PRs that were predicted approvable by Codeball were actually approved. In layman, if Codeball says that a PR is approvable, you can be 99% sure that it is.

But recall is 48%, meaning that only 48% of actually approved PRs were predicted to be approvable. So Codeball incorrectly flagged 52% of the approvable PRs to be un-approvable, just to be safe.

So Codeball is like a strict bartender who only serves you when they are absolutely sure you're old enough. You may still be overage but Codeball's not serving you.


A LOT of ML applications should be exactly like this.

I want systems with low recall that "flag" things but ultra ultra high precision. Many times, we get exactly the opposite - which is far worse!


Here's a visual explainer on Precision vs Recall (in the context of ML algorithms):

https://mlu-explain.github.io/precision-recall/


That’s still super useful.

I’m assuming most PR’s are approvable. If that’s the case then this should cut down on time spent doing reviews by a lot.


So basically, very few false positive but lots of false negative is the tradeoff made by Codeball?


I'm a bit skeptical here. We should ask the question: why are we reviewing code in the first place? This has some hot debates on HN every now and then, and it's because reviews are not just automated checks but part of the engineering culture, which is a defining part of any company or eng department.

PR reviews are a way of learning from each other, keeping up with how the codebase evolves, sharing progress and ideas, giving feedback and asking questions. For example at $job we 90% approve PRs with various levels of pleas, suggestions, nitpicks and questions. We approve because of trust (each PR contains a demo video of a working feature or fix) and not to block each other, but there might be important feedback or suggestions given among the comments. A "rubber stamp bot" would be hard to train in such a review system and simply misses the point of what reviews are about.

What happens if there is a mistake (hidden y2k bomb, deployment issue, incident, regression, security bug, bad database migration, wrong config) in a PR that passes a human review? At a toxic company you get finger pointing, but with a healthy team, people can learn a lot when something bad passes a review. But you can't discuss anything with an indeterministic review bot. There's no responsibility there.

Another question is the review culture. If this app is trained on some repo (whether PRs were approved or not), past reviews reflect the review culture of the company. What happens when a blackbox AI takes that over? Is it going to train itself on its own reviews? People and review culture can be changed, but a black box AI is hard to change in a predictable way.

I'd rather set up code conventions, automated linters (i.e. deterministic checks) etc. than have a review bot allow code into production. Or just let go of PR reviews altogether, there were some articles shared on HN about that recently. :)


I agree with a lot of what you are saying here.

> why are we reviewing code in the first place?

It being part of engineering culture is spot on. I think of it as two things: (1) quality gate and (2) knowledge sharing. Because of (1), by default reviews can feel a bit like submitting homework - not all contributions are of the same risk level but they follow the same process.

The idea behind Codeball is unassuming - identify and approve the bulk of easy contributions so that devs can focus their energy reviewing the trickier ones. This is can be especially nice in a trustful environment, keeping the momentum for devs to ship small & often.

Another thing is - models can incorporate a surprising amount of indicators, for example, not just the outcome of the PR but also what happens to contribution after merging (was the code retained as-is or was it hot fixed a day later etc).


I def think 90% of the value in code review is knowledge sharing and general coding practice discussion in good eng cultures. Tests should really catch glaring mistakes and it's fairly rare that someone will be like this absolutely won't work for these reasons which you didn't catch.

If anything I think code review as a "nothing bad will happen" check gives a really false sense of security unless you have a super strict bus factor crazy smart kinda asshole engineer on the team who is probably going to piss everyone off with strict code reviews that most of the time are about personal preference but sometimes actually do catch the edge cases.


This could be really useful for large-scale changes across the company’s codebase that are usually reviewed by one high level engineer who doesn’t know much about the code being changed, but the changes is pushed through anyways because getting approval from all the owners would prevent the change from happening at all. In this case, automated code review makes more sense compared to more common localized code changes.

But those large-scale changes are also usually systematic, so wouldn’t have much to do with coding conventions or styles.


I would never use something like this. Seems to me that it's just a heuristic based on the char diff count. I made a simple repo that has a shell script that does rm -rf /usr/old_files_and_stuff, added a space next to the first slash and it was approved, which is dangerous. If I need to manually verify it anyways for stuff like this, why would I use it?


I generally feel the same way, but just to steel man the argument: would your manual code review process have caught this issue?

Sometimes we compare new things against their hypothetical ideal rather than the status quo. The latter is significantly more tractable.


> would your manual code review process have caught this issue?

On a one character code change? I’m inclined to think so.


I think I like this better expressed as a linter than a code reviewer. Maybe it doesn't sell as well. But giving this to devs to help them make better PRs and have more confidence in approval? Good. Skipping code review? Bad.

In my experience, most "issues" in code review are not technical errors, they are business logic errors, which there is most of the time not even enough context in the code to know what the right answer is. It is in a PM or Sales Person's head.


Your comment changed my mind about this. I still don’t want it to do reviews for me, but now I do want it to pre-review my changes before I create a PR so that I can avoid unnecessary review cycles from obvious mistakes.


Agreed. As it is, it's a great pre-step.

Skipping code review is depriving the team of an opportunity to learn about the new incoming change, and depriving them of sharing knowledge (better implementations, business context, etc.)


Would be great to do something similar for journal reviews too!


Creator of Codeball here, somebody beat us to sharing it :).

Codeball is a result of a hack-week at Sturdy - we were thinking about ways to reduce the waiting-for-code-review and were curious exactly how predictable the entire process is. It turned out very predictable!

Happy to answer any questions.


I know for a fact I would not want to automate many of the predictable aspects of code reviews at any job I've ever had. This is because many of the predictable aspects of code review are due to poor review practices. Things like rubber-stamping review requests with a certain change size (e.g. lines of code, number of files), surface-level reviews (e.g. "yep this doesn't violate our style guidelines that the linter can't catch"), and similar items.

A proper code review isn't simply catching API or style errors--it seeks to understand how the change affects the architecture and structure of the existing code. I'm sure AI can help with that, and for a broad class of changes it's likely somewhat to very predictable--but I'm skeptical that it is predictable for enough use cases to make it worth spending money on (for now), say.

Put another way: "approves code reviews a human would have approved" isn't exactly the standard I'd want automated reviews to aspire to. Human approval, in my experience, is mostly not good quality reviews.


Maybe the AI approach is still useful. I am thinking analysing the AST to measure the impact of a code change, or the complexity of the various components of the project. Some kind of graph analysis to measure complexity and maintainability on a project level.


My thesis is a tool like Codeball would reduce the amount of rubber-stamping. Thing is, many devs aim to ship code "small and often" which inevitably leads to fast pattern matching style reviews. If software can reliably deal with those, humans can focus their energy on the tricky ones. Kind of like how using a code formatter eliminates this type of discussions and lets people focus on semantics.


How much will you be charging for the adversarial network to allow someone to get any PR approved? ;)


Hi. I've tried creating the same service, about 5 years back ;) articoder.com ;) Was digging at it for a few months. But natural language processing of the time was not up to the task...

Good to know that now it is doable in a week, with such good precision! Or do you have humans in the backend ;) ?

How do you compare yourself to PullRequest (they've been digging at it for 5 years as well) and recently folded? [funny fact, we've been interviewed in the same YC batch, which always makes me wonder, if YC liked the idea enough to have it implemented by another team ;) ]


It's really cool to hear that others have thought about this too!

>How do you compare yourself to PullRequest

So it turns out that the most of code contributions nowadays get merged without fixes or feedback during the review (about 2/3). I think this is because the increased focus on continuous delivery and shipping small & often. Codeball's purpose is to identify and approve those 'easy' PRs and humans get to deal with the trickier ones. The cool part about it is being less blocked.


Is your model trained per language?

Without something that semantically understands the code under review ( which all but requires general AI or at the least a strong static analyslzer) doing anything more than adding noise to the process or worse leading to certain groups of developers effectively being given a free pass.


It is not trained per language but it has 2 things up its sleeve: it considers the author's past experience in the context of the files being changed as well as if similar code changes (perceptual hashes) are associated with objections or fixes.


Both of those statements convince me even more that this is a bad idea. While the author's past experience is important, it has little bearing on the current PR. Same for similar code change. In code review, the skill/history of the developer is only really relevant when writing comments. You should look for the same potential mistakes and logic errors in a Senior or Junior developers PR. Adding the developers experience as an input could easily lead to the model deferring to experience. In my mind, that makes the signal this is providing potentially harmful, not helpful.


> we were thinking about ways to reduce the waiting-for-code-review

Code reviews should be an interrupt for everything except downtime mitigation.

Reviewing your peer's code quickly will cause them to do the same to you. It is a virtuous circle.

Be the change you want to see.


For open contribution projects, you can't use this because it's trivial to slip in something malicious.

For projects with trusted contributors only, PRs are usually approvable anyway, a one bit blackbox signal telling you some of them are (with zero explanation, it seems?) isn't very valuable.

Not sure why you would use this.


I can't tell if this is a joke or not


    if len(diff) > 500 lines:
        return "Looks good to me"
    time.sleep(86400)
    return "+1"


It is definitely not a joke. This started off as scratching our own itch in answering 'how predictable are programmers?' but it turned out to be really useful, so we made a site.


Good to know; is there an example (other than its own GH Action repo) to see what it has approved?

Given that it's a model, is there a feedback mechanism through which one could advise it (or you) of false positives?

I would be thrilled to see what it would have said about: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/76318 (q.v. https://news.ycombinator.com/item?id=30872415)


I've tried to reproduce #76318 as best as I could (using a fork of the CE version of GitLab). https://github.com/zegl/gitlabhq-cve-test/pull/1

Codeball did not approve the PR! https://codeball.ai/prediction/8cc54ce2-9f50-4e5c-9a16-3bc48...


It would have said nothing. The model's idea is to identify the bulk of easy / safe contributions that get approved without objection and let the humans focus on the tricky ones (like the example above).

On the site you can give it GitHub repos where it will test the last 50 PRs and show you what it would have done (false negatives and false positives included). You can also give it a link to an individual PR as well, but GitLab is not yet supported.


>Codeball uses a Multi-layer Perceptron classifier neural network as prediction model. The model takes hundreds of inputs in it's input layer, has two hidden layers and a single output scoring the likelihood a Pull Request would be approved.

Really bringing out the big guns here!


I would be a bit concerned about adversarial attacks with this. I’m sure someone will be able to come up with an innocent looking PR that the system will always approve, but actually is malicious. Then any repo which auto-approves PRs with this could be vulnerable.


There are 3 categories of predictors that the model takes into account, here are some examples: (1) The code complexity, and its perceptual hash. (2) The author and their track record in the repository. (3) The author's past involvement in the specific files being modified.

With that said, an adversarial from somebody within the team/organisation would be very difficult to detect.


Looks awesome.

Tone down the marketing page :) This page makes it sound like a non-serious person built the tool.

How about: "Codeball approves Pull Requests that a human would approve. Reduce waiting for reviews, save time and money."

And make the download button: "Download"


This is an exciting direction for AI code tools! I'm curious to see code review tools that give feedback on non-approved code to developers, which I think is the an important purpose of code review, to build a shared understanding of technical standards.

On a related note, I'm working on https://denigma.app, which is an AI that tries to explain code, giving a second opinion on what it looks like it does. One company said they found it useful for code review. Maybe just seeing how clear an AI explanation is is a decent metric of code quality.


This is a neat idea but gives me pause. Thinking about how it would work in projects I maintain, it would either:

- be over-confident, providing negative value because the proportion of PRs which “LGTM” is extraordinarily low, and my increasingly deep familiarity with the code and areas of risk makes me even more suspicious when something looks that safe

- never gain confidence in any PR, providing no value

I can’t think of a scenario where I’d use this for these projects. But I can certainly imagine it in the abstract, under circumstances where baseline safety of changes is much higher.


So I dry ran it against a tiny open source repo I maintain and it worked on exactly 0 of the last 50 PRs. For example, it didn’t auto-approve a PR that was just deleting stale documentation files... The idea sounds nice, but the execution is a bit lacking right now.


I don't really get the point of it either, since it just approves PRs. I know when my PR is mergeable, you don't have to tell me that. What I need is some feedback since that's what code review is for.

Any linter is more useful than this.


Probably I'm misled but how is it a code review without looking at the actual code? (not listed as an input feature on the 'how' page)


It does look at the code at a meta level, in particular if the kind of change in the PR has previously been objected to or corrected afterwards. It creates perceptual hashes out of the code changes which are used as categorical variables that go in the neural net.

Deriving features about the code contributions is probably the most challenging aspect of the project so far.


"Download free money" sounds like a scam.


It's just poking fun at other websites being too serious.


is this just (not) approving or is it actually providing automated feedback for what needs to be fixed and suggestions?


It is like a first-line reviewer. It approves contributions that it is really confident are good and leaves the rest to humans. So basically it saves time and content switching for developers.


Is there no marker that can be provided to indicate why it failed or even a line number?

Can't tell if it's something like formatting and code style or "bad code" or what. Even as a first line reviewer I can't tell if this is valuable or not without any details on why it would approve something.

The PR's it would Approve here were all super minor. Could probably get similar number of these approved just by doing a Lines of Code changed + "Has it been linted"

It's really hard to tell if this is valuable or not yet.


You are making a very good point. Right now it can't give such indication because it is a black-box model. There are hundreds of inputs that go in (eg. characteristics of the code, how much the author has worked with this code in the past, how frequently this part of the code changes) and the output is how confident the model is that the contribution is safe to merge.

With that said, there are ways of exposing more details to developers. For example, scoring is done per-file, and Codeball can tell you which files it was not confident in.


Would love to see an FOSS fork of this, no other way we can use it at $MAGMOA day job.

I can feel it.. it wants to be free!


Bravery test: do you have a publicly accessible repo that uses this to accept pull requests?


just curious, why not add the ability of reviewing the existing source code and not just pull requests?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: