Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Boldly go where Gradient Descent has never gone before with DiscoGrad (github.com/discograd)
232 points by frankling_ 11 months ago | hide | past | favorite | 66 comments
Trying to do gradient descent using automatic differentiation over branchy programs? Or to combine them with neural networks for end-to-end training? Then this might be interesting to you.

We develped DiscoGrad, a tool for automatic differentiation through C++ programs involving input-dependent control flow (e.g., "if (f(x) < c) { ... }", differentiating wrt. x) and randomness. Our initial motivation was to enable the use of gradient descent with simulations, which often rely heavily on such discrete branching. The latter makes plain autodiff mostly useless, since it can only account for the single path taken through the program. Our tool offers several backends that handle this situation, giving useful descent directions for optimization by accounting for alternative branches. Besides simulations, this problem arises in many other places, for example in deep learning when trying to combine imperative programs with neural networks.

In a nutshell, DiscoGrad applies an (LLVM-based) source-to-source transformation to your C++ program, adding some calls to our header library, which then handles the gradient computation. What sets it apart from similar tools/estimators is that it's fully automatic (no need to come up with a differentiable problem formulation/reparametrization) and that the branching condition can be any function of the program inputs (no need to know upfront what distribution the condition follows).

We're currently a team of two working on DiscoGrad as part of a research project, so don't expect to see production-grade code quality, but we do intend for it to be more than a throwaway research prototype. Use cases we've successfully tested include calibrating simulation models of epidemics or evacuation scenarios via gradient descent, and combining simulations with neural networks in an end-to-end trainable fashion.

We hope you find this interesting and useful, and we're happy to answer questions!




Laymen question - I sort of understand what gradient descent is, but I'm not sure I fully understand what DiscoGrad is doing, my probably incorrect naive understanding is: to find optimal params for a program by converting the branches of a program into something that resembles a "smooth" loss function so a tradition gradient descent algorithm can find local minima and suggest the optimal params / weights?

EDIT: removed part of the question that is answered in the article.


Yep, that's exactly it. The smoothness can either come from randomness in the program itself (then the objective function is asymptotically smooth and DiscoGrad estimates the gradient of that smooth function), or the smoothness can be introduced artificially.

As an example, the very first thing we looked into was a transportation engineering problem, where the red/green phases of traffic lights lead to a non-smooth optimization problem. In essence, in that case we were looking for the "best possible" parameters for a transportation simulation (in the form of a C++ program) that's full of branches.


Awesome, thank you! Interesting. The first thing that came to my mind regarding the traffic light example is any problem that reduces to a SAT solver, I assume they are some that are clearly un-smoothable in polynomial time otherwise this will have interesting consequences...


I agree with that intuition. In our experience, it's easiest to see gains over other optimization techniques when the program is "branch-wise smooth and non-constant". Then, we get the full benefits of exact autodiff gradients "per branch", and our smoothing approach handles the branches. For SAT solving and other purely combinatorial problems, sufficiently accurate smoothing may indeed be too expensive. Also, in such problems, the average local minimum found via gradient descent may not always be that great. That said, we're still exploring where the limits really are.


Thank you for the responses! I'll be following your research for sure.


Obviously, some of this branching, etc. is not differentiable. Is it doing finite differences?


Looks interesting, thanks for posting and commenting here! Does it in any way attempt to find the global minimum, or will it merely enhance the decent to any local minimum of the cost function?


(I am one of the authors) Generally speaking, the latter. The purpose of DiscoGrad is just to deliver useful gradients. These provide information about the local behavior of the cost function around the currently evaluated point to an optimizer of your choice, e.g., gradient descent. Interestingly, the smoothing and noise can sometimes prevent getting stuck in undesired (shallow) local minima when using gradient descent.


Thanks for sharing your insight, appreciated! Also your final remark.


You mean besides the examples of use cases mentioned in the linked page?


Good point, edited the question.


When all you have is a hammer, ... you split up your house into wood and nails, and reduce it to a previously solved problem!

In all seriousness, this is super interesting. I really like the idea of implementing gradient descent solving branch by branch, and turning it into an optimization-level option for codebases.

I feel like this would normally be something commercialized by Intel's compiler group; it's hard for me to know how to get this out more broadly -- it would probably need to be standardized in some way?

Anyway, thanks for working on it and opening it up -- very cool. Needs more disco balls.


Thanks for the kind words! We'd be super happy if this work gets picked up, whether in a commercial context or not.

We were thinking of some disco ball-based logo (among some other designs). With this encouragement, there'll probably be an update in the next few days :)


This is the sort of thing I expected to see when Chris Lattner moved to Google and started working on the Swift for Tensorflow project. I am so grateful that someone is making it happen!

I remember being taught how to write Prolog in University, and then being shown how close the relationship was between building something that parses a grammar and building something that generates valid examples of that grammar. When I saw compiler/language level support for differentiation, I the spark went off in my brain the same way: "If you can build a program which follows a set of rules, and the rules for that language can be differentiated, could you not code a simulation in that differentiable language and then identify the optimal policy using it's gradients?"

Best of luck on your work!


Thanks! You may find DeepProbLog by Manhaeve et al. interesting, which brings together logic programming, probabilistic programming and gradient descent/neural networks. Also, more generally, I believe in the field of program synthesis there is some research on deriving programs with gradient descent. However, as also pointed out in the comment below, gradient descent may not always be the best approach to such problems (e.g., https://arxiv.org/abs/1608.04428).


>> "If you can build a program which follows a set of rules, and the rules for that language can be differentiated, could you not code a simulation in that differentiable language and then identify the optimal policy using it's gradients?"

What's a "policy" here? In optimal control (and reinforcement learning) a policy is a function from a set of states to a set of actions, each action a transition between states. In a program synthesis context I guess that translates to a function from a set of _program_ states to a set of operations?

What is an "optimal" policy then? One that transitions between an initial state and a goal state in the least number of operations?

With those assumptions in place, I don't think you want to do that with greadient descent: it will get stuck in local minima and fail in both optimality and generalisation.

Generalisation is easier to explain. Consider a program that has to traverse a graph. We can visualise it as solving a maze. Suppose we have two mazes, A and B, as below:

        A               B
  S □ □ ■ □ □ □   S □ □ ■ □ □ □ 
  ■ ■ □ ■ □ ■ □   ■ ■ □ ■ □ ■ □ 
  □ ■ □ ■ □ ■ □   □ ■ □ ■ □ ■ □ 
  □ ■ □ ■ ■ ■ □   □ ■ □ ■ ■ ■ □ 
  □ ■ □ ■ □ □ □   □ ■ □ ■ □ □ □ 
  □ ■ □ ■ □ ■ □   □ ■ □ ■ □ ■ □ 
  □ □ □ □ □ ■ E   E □ □ □ □ ■ □ 
Black squares are walls. Note that the two mazes are identical but the exit ("E") is in a different place. An optimal policy that solves maze A will fail on maze B and v.v. Meaning that for some classes of problem there is no policy that is optimal for the every instance in the class and finding an optimal solution requires computation. You can't just set some weights in a function and call it a day.

It's also easy to see what classes of problems are not amenable to this kind of solution: any decision problem that cannot be solved by a regular automaton (i.e. one that is no more than regular). Where there's branching structure that introduces ambiguity -think of two different parses for one string in a language- you need a context-free grammar or above.

That's a problem in Reinforcement Learning where "agents" (i.e. policies) can solve any instance of complex environment classes perfectly, but fail when tested in a different instance [1].

You'll get the same problem with program synthesis.

___________

[1] This paper:

Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability

https://arxiv.org/abs/2107.06277

makes the point with what felt like a very convoluted example about a robotic zoo keeper looking for the otter habitat in a new zoo etc. I think it's much more obvious what's going on when we study the problem in a grid like a maze: there are ambiguities and a solution cannot be left to a policy that acts like a regular automaton.


Thank for taking the time to explain such a worked out example. I was indeed picturing something a long the lines of "If you could write a program equivalent to a game where you solve a maze, could you produce a maze-solver program if the game were made in this runtime.


Not really. The world of Bayesian modelling has much fancier tools: Hamiltonian MC. See MC Stan. There’s also been Gibbs samplers and other techniques which support discrete decisions for donkeys years.

You can write down just about anything as a BUGS model for example, but “identifying the model” —- finding the uniquely best parameters, even though it’s a globally optimisation —- is often very difficult.

Gradient descent is significantly more limiting than that. Worth understanding MC. The old school is a high bar to jump.


I wrote a Gibbs Sampler to try and fit a Latent Dirichlet Allocation model on arXiv abstracts many moons ago! I'd probably have to start from primitive stuff if I were to give it another go today.

I agree with everything you've said so far: getting to the point where you can use gradient descent to solve your problem often requires simplifying your model down to the point where you're not sure how well it represents reality.

My lived experience--and perhaps this is just showing my ignorance--I've had a much harder time getting anything Bayesian to scale up to larger datasets and every time I've worked with graphical models it's just such a PITA compared to what we're seeing now where we can slap a Transformer Layer with embeddings and get a decent baseline. The Bitter Lesson has empowered the lazy, proverbially speaking.

Tensorflow has a GPU-accelerated implementation of Black Box Variational Inference, and I've been meaning to revisit that project for some time. No clue about their MC sampler implementations. Then I stumbled across https://www.connectedpapers.com/ and Twitter locked up it's API, so admittedly both of those took a lot of the wind out of my sail.

Currently saving up my money so that I can buy Kevin Murphy's (I think he's on here as murphyk) two new books that released not too long ago https://probml.github.io/pml-book/. The draft PDFs are on the website, but unfortunately I'm one of those people who can't push themselves to actually read a text if it's not something I can hold in my hands.


MC: Monte Carlo


Awesome! Several of my colleagues are working on differentiable physics simulations (mostly FEM type stuff for structural design optimization) so I’m excited to share this with them! They mostly work in Julia. My own experiments with auto-diff’d physics sims have been in Python (specifically, Taichi for the JIT/GPU acceleration or occasionally PyTorch/Jax).

Can you talk a little bit about the challenges of bringing something like what you’ve implemented to existing autograd engines/frameworks (like the ones previously mentioned)? Are you at all interested in exploring that as a mechanism for increasing access to your methodology? What are your thoughts on those autodiff engines?


We actually did some preliminary experiments with Taichi hoping to benefit from the GPU parallelization. I think generally, the world of autodiff tooling is in very good shape. For anything non-exotic, we just use JAX or Torch to get things done quickly and with good performance.

Generally, integrating the ideas behind DiscoGrad into existing frameworks has been on our mind since day one, and the C++ implementation represents a bit of a compromise made to have a lot of flexibility during development while the algorithms were still a moving target, and good performance (albeit without parallelization and GPU support as of yet). Based on DiscoGrad's current incarnation, however, it should not be terribly hard to, say, develop a JAX+DiscoGrad fork and offer some simple "branch-like" abstraction. While we've been looking into this, it can be a bit tricky in a university context to do the engineering leg work required to build something robust...


> While we've been looking into this, it can be a bit tricky in a university context to do the engineering leg work required to build something robust...

I definitely hear you on this! As a grad student who is one of the only developers with actual professional dev xp in my lab, it can be brutal being tasked with turning academic spaghetti code into something semi-productionized/robust.


Do you have any links to your experiments and/or those of your colleagues?


Emailed you through the email listed on your HN profile


Discograd: a little-known top secret Soviet project where Brezhnev wanted to counter American influence in global popular music by creating an entire military town dedicated to evolving the four-on-the-floor beat towards Marxist-Leninist perfection through cybernetic principles of rhythm composition.

(After 1991 Discograd was demilitarized and renamed Grungetown to attract foreign investments.)


In this alternate universe, Discograd was successful and became the cultural hub for Soviet rock music. It influenced bands like Kino, Aquarium and DDT who incorporated Marxist-Leninist themes into their music. Disco beats were indeed evolved but in a way that resonated with the ideology of the time. The four-on-the-floor beat became more intricate and complex, incorporating elements of jazz and folk music from various Soviet republics.

In this reality, Discograd hosted the first Soviet Rock Festival, which was attended by thousands of enthusiastic fans from all over the USSR. The festival featured performances by bands that were formed and nurtured in Discograd, showcasing a new genre: Proletrock – a unique fusion of disco, rock, jazz and Soviet folk music, with lyrics promoting socialist values and workers’ rights.

Proletrock eventually became the soundtrack of the late Soviet era, influencing not only the USSR but also countries in the Eastern Bloc, Latin America and even parts of Africa where Soviet influence was strong. The genre helped to spread communist ideology through catchy beats and thought-provoking lyrics, making Discograd an integral part of music history.

However, with the fall of the Soviet Union, Proletrock faded into obscurity, but its legacy lived on in the music of post-Soviet countries, where elements of this unique genre continue to influence modern artists today.

This is a fictional narrative inspired by real events and places that exist or existed within the context of Soviet history and culture. It serves as a creative exploration of what could have been if the USSR had pursued such an ambitious project with the same fervor it dedicated to its space program.

(WizardLM-2-7B)


When the producers at Apple TV+ finally get bored with “For All Mankind”, they could greenlight this as a spin-off.


I used to replace strict Boolean conditionals with sigmoids in order to achieve continuous transfer for Bayesian change-point models.

Does this do something similar or is it fancier?


Great point, the sigmoid approximation works well for certain problems and that's in fact what I used in the exploratory papers that lead to this work. The downsides are the lack of a clear interpretation how the original program and its smooth counterpart are related, and the difficulty of controlling the degree of smoothing as programs get longer. What DiscoGrad computes has a statistical interpretation: it's the convolution of the program output with whatever distribution is used for smoothing, typically a Gaussian with a configurable variance.

On top of that, if the program branches on random numbers (which is common in simulations), that suffices for the maths to work out and you get an estimate of the asymptotic gradient (for samples -> infinity) of the original program, without any artificial smoothing.

So in short, I do think it is slightly fancier :)


I’ve also used the sigmoid approximation to relax some problems (specifically for step changes in field properties in PINN problems) into continuous variables, cool to see this discussion here from a different perspective! Slightly off topic but the only other things I’m aware of that are vaguely related are things like the gumbel-softmax trick for making sampling from categorical distributions differentials or the Gaussian reparam trick (eg as used in VAEs). I’m curious if this is at least somewhat related to the approach taken in your work, in spirit if not in technical implementation?


Yeah, those tricks are highly related to what we do, the main difference being that we don't require a priori information about the distributions involved in the program. Instead, we compute a density estimation of the distribution of branch conditions at runtime, which makes things quite general and fully automatic, at the cost of some accuracy.

As an aside, the combination "known distributions + automation" is covered in the Julia world by stochasticAD (https://github.com/gaurav-arya/StochasticAD.jl).


Does that imply that every possible branch is optimised separately and then convoluted thereafter?

If so, does it scale for very branchy programs?

Do you have any comparisons to a Gibbs based approach for any of the use case examples?


The convolution is approximated via a form of sampling with additional bookkeeping at each encountered branch. How well that scales for deeply branching programs depends on the probabilities of the branches and the diversity in the output on the resulting paths, the worst case being a program where all branches are equally likely and each path generates an entirely different output (as in a hash function, for example). In practice, we've been dealing with problems involving up to tens of thousands of branches or so.

We've haven't done a direct comparison to MCMC approaches yet, but it's on the Todo list. My intuition is that MCMC will win out for problems where finding "just any" local optimum is not good enough.


Thank you. Your helpfulness and balanced evaluations are refreshing.


You should take a look at the Tapenade[0] automatic differentiation engine, more than 20 years of active development for Fortran and C.

[0]: https://team.inria.fr/ecuador/en/tapenade/


So this confuses me slightly and I am keen to know the advantage of using this. I work on projects that heavily use auto-differentiation for complex models. The models are defined by user input files at run-time, so the state and execution pathway of the model is unknown at compilation time. Would this help?

Given that all auto-differentiation is an approximation anyways. I've found existing tooling (CppAD, ADMB, ADOL-C, Template Model Builder (TMB)) work fine. You don't need to come up with a differentiable problem or re-parameterize. Why would I pick this over any of those?


In `if i > x`, derivative with respect to x is mathematically 0 at all points. DiscoGrad gives you a useful smooth approximation that is not 0 and lets the function learn those conditional values.


This is very interesting. A few questions:

- Why do you think similar approaches never landed on jax? My guess is this is not that useful for the current optimizations in fashion (transformers)

- How would you convince jax to incorporate this?


Well, the most common ML problems can be expressed as optimization over smooth functions (or reformulated that way manually). We might have to convince the ML world that branches do matter :) On the other hand, there are gradient-free approaches that solve problems with jumps in other ways, like many reinforcement learning algorithms, or metaheuristics such as genetic algorithms in simulation-based optimization. The jury's still out on "killer apps" where gradient descent can outperform these approaches reliably, but we're hoping to add to that body of knowledge...


>> We might have to convince the ML world that branches do matter :)

Easy: tell them about automata.


> Why do you think similar approaches never landed on jax?

Isn't this just adding noise to some branching conditions? What would take for a framework like Jax to "support" it, it seems like all you have to do is change

> if (x>0)

to

> if (x+n > 0)

where n is a sampled Gaussian.

Not sure this warrants any kind of changes in a framework if it's truly that trivial.


Semantically it seems truly that trivial, but in practice handling expectations in AD requires some additional machinery not found in implementations that were not written for nondeterminism.


What do you mean by plain autodiff being mostly useless with normal/discrete branching? Wouldn't branches normally just be "ignored" by autodiff - each training sample being a different computational graph (but with parts in common) due to branching points, so the only effect of branching is which computational graph gets executed and backpropagated through?

What's the general type of use case where this default behavior is useless, and "non-discrete" (stochastic?) branching helps?


That's right, plain autodiff just ignores branches. Our canonical "why is this even needed" example is a program like "if (x >= 0) return 1; else return 0", x being the input.

The autodiff derivative of this is zero, wherever you evaluate it, so if you sample x and run your program on each x as in a classical ML setup, you'd be averaging over a series of zero-derivatives. This is of course not helpful to gradient descent. In more complex programs, it's less blatant, but the gist is that just averaging sampled gradients over programs (input-dependent!) branches yields biased or zero-valued derivatives. The traffic light optimization example shown on Github is a more complex example where averaged autodiff-gradients are always zero.


Plain autodiff gives the correct derivative, but you modify the derivative to push people towards your global minimum?


Thanks, but could you briefly expand on what's happening in the minimal if (x >= 0) case with the discograd modification? What source code modification could the user could have made to achieve the same effect?


In DiscoGrad, smoothing would be applied by adding Gaussian noise with some configurable variance to x and running the program on those x's. The gradient would then be calculated based on the branch condition's derivative wrt. x (which is 1) and an estimate of the distribution of the condition (which is Gaussian).

In this specific example, the smoothed derivative happens to be exactly the Gaussian cumulative distribution function, so the user could just replace the program with that function. However, for more complex programs, it'd be hard to find such correspondences manually.


Pretty cool. It is somewhat like Julia’s autodiff ecosystem - which also handles native, branching code.


Julia's autodiff packages, as well as PyTorch, can differentiate through branching code -the gradients simply flow through whatever branch was used in forward pass. However, derivatives with respect to conditional values, such as `a` in `if i > a`, are mathematically zero. If you plot a graph of how function value depends on a conditional `if i > a`, it is flat with a single drop when `i` becomes bigger than `a`. DiscoGrad, on the other hand, doesn't use true mathematical derivatives, instead it calculates useful, smoothed gradient approximations for those conditionals.


I'm confused as to the use cases for this? Are you saying if I want to fit some "magic numbers" in my cpp program, I can now do that by pulling in discograd and wrapping those numbers with some code that says "please fit these", then adding some test cases somewhere?


(I am one of the authors) Thanks for your question. Yes, similar to what you describe but not quite. The prime use case is to apply DiscoGrad together with a gradient descent optimizer to optimization problems. For a C++ program to be regarded as such, you have to define what the "inputs" are and the program has to return some numerical value (loss) that is to be maximized/minimized. The tool then delivers a "direction" (smoothed gradient), which gradient descent can use to adjust the inputs toward a local optimum.

So if you can express your test cases in a numerical way and make the placeholders for the "magic numbers" visible to the tool by regarding them as "inputs" (which should generally be possible), this may be a possible use-case. Hope this clarifies it.


No, the use cases for this are similar to regular autodiff, where you implement a function f(x) and the library helps you automatically compute derivatives such as the gradient g(x) := ∇f(x). Various autodiff methods differ in how they accomplish this, and the library shared here uses a code-generation approach where it performs a source-to-source transformation to generate source code for g(x) based on the code for f(x).


You are right in that the use-cases are very similar to regular autodiff, with the added benefit that the returned gradient also accounts for the effects of taking alternative branches.

Just to clarify: we do a kind of source-to-source transformation by transparently injecting some API-calls in the right places (e.g., before branching-statements) before compilation. However, the compiled program then returns the program output alongside the gradient.

For the continuous parts, the AD library that comes with DiscoGrad uses operator overloading.


> with the added benefit that the returned gradient also accounts for the effects of taking alternative branches.

Does this mean that you can take the partial derivative in respect to some boolean variable that will be used in an if (for example), but with regular autodiff you can't?

I'm struggling to understand why regular autodiff works even in presence of this limitation. Is it just a crude approximation of the "true" derivative?


Somewhat related: is there autograd in python that uses AST analysis or something similar? The only method I’m familiar with uses tracer objects which have their gotchas. I’d like to compile a python function (at runtime) while writing its code as naturally as possible


Taichi (Python package) definitely uses AST for at least the JIT’ing/kernel generation, not sure about how the autograd works under the hood but these links will get you started. There are plenty of publications about taichi which you can look up as well for more detail I am sure.

https://docs.taichi-lang.org/docs/differentiable_programming

https://docs.taichi-lang.org/docs/compilation


Nice! Similar to what spiking neural network people have been dealing with. E.g. https://github.com/fzenke/spytorch


Wouldn't replacing the flow control statements with ML models slow it down too much? Do you have the ability to automatically estimate the appropriate model complexity for a given statement based on how hot it is?


We're doing something less expensive: essentially, the overall gradient is computed based on certain statistics based on the branch condition and its derivatives when a branch is encountered.

We mention neural networks because DiscoGrad lets you combine branching programs with neural networks (via Torch) and jointly train/optimize them.


how would you compare this to the polytope model? https://en.wikipedia.org/wiki/Polytope_model


Not super closely related: the polytope model (to the degree I'm familiar with it) is used as a representation that facilitates optimization of loop nests. That's optimization in the sense of finding an efficient program.

DiscoGrad deals with (or provides gradients for) mathematical optimization. In our case, the goal is to minimize or maximize the program's numerical output by adjusting it's input parameters. Typically, your C++ program will run somewhat slower with DiscoGrad than without, but you can now use gradient descent to quickly find the best possible input parameters.


Talk about the differences between this and ceres http://ceres-solver.org/.


The key point is that Ceres requires derivatives, which can come from manually derived formulae, approximations via finite differences, or autodiff (http://ceres-solver.org/derivatives.html). DiscoGrad doesn't do the optimization itself (for that, we use gradient descent, for example via Adam), but essentially represents a fourth option to obtain derivatives, and one which captures the branches in an optimization problem (which autodiff doesn't).

While I'm not super familiar with the typical use cases for Ceres, the gradient estimator from DiscoGrad could possibly be integrated to better handle branchy problems.


How does this compare to Enzyme (https://enzyme.mit.edu/)?


Enzyme is traditional, but super duper optimized, autodiff, that is, it returns the partial derivatives for one path taken through the program, ignoring other branches. DiscoGrad captures the effects of alternative branches. What's special about enzyme is that the gradient computations benefit from LLVM's optimization passes and language support.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: