Hacker News new | past | comments | ask | show | jobs | submit login
Magicoder: Source Code Is All You Need (arxiv.org)
183 points by tosh on Dec 5, 2023 | hide | past | favorite | 67 comments



For anyone looking to try this out on consumer grade hardware here's a q4 version [0]. From initial testing it's subjectively a bit behind deepseek-coder-instruct [1] at the same size and quantization. Deepseeks model is near magic when it behaves (1-2 tries), spitting out nicely formatted code fence blocks in markdown that I can naively render in real time to get a local chatgpt like experience. Magicoder can do this too, but it usually takes 3-4 tries and it tends to "opt out" and ask for more info pretty frequently. Of course if you have a ton of vram, use the 33b-instruct variant instead.

After more testing, I think it's a toss up on most coding tasks but Magicoder tends to give subjectively better responses to "bad prompts". That is, prompts where you don't put effort into writing clear instructions. For example, one of my "bad prompt" tests is

> how to enable shared gpu memory in wsl2 docker container

A good response to this would discuss the nvidia container toolkit, maybe something about port forwarding, etc. But this isn't a prompt most models can give good responses to. Both of these models can handle it, even at 7b, but Magicoder gives more information.

[0] https://huggingface.co/LoneStriker/Magicoder-S-DS-6.7B-4.0bp...

[1] https://huggingface.co/bartowski/deepseek-coder-6.7b-instruc...


From my experience when using llama.cpp, using min-p sampling gives far better results than the defaults, for example (note this is using deepseek-coder 33b)

Prompt: "Write the fibonacci function in Python3"

min_p 0.05

    def fibonacci(n:int, memo={}) -> int:
        if n <=2 : return 1 # base case 1, 2 => 1 
    
        if n not in memo:  
            memo[n] = fibonacci(n-1, memo) + fibonacci(n-2,memo) # store the result
        
        return memo[n]
    # Test cases 
    print("Test case 1")
    print(f"The 5th Fibonacci is {fibonacci(5)} ")  
    print(f"The 9th Fibonacci is {fibonacci(9)} ")
Default sampling settings

    def fibonacci(n):
      if n == 0:
          return 0
      elif n ==1:
          return 1
      else :
          return (fibonacci(n - 1) + fibonacci(n -2))
    n = int(input())
    print(fibonacci(n))
Seed is the same.

So if you're using these models locally, make sure you're using min-p sampling. All other samplers are genuinely very suboptimal.

For more in-depth info about _why_: https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_se...


> spitting out nicely formatted code fence blocks in markdown that I can naively render in real time to get a local chatgpt like experience.

Not sure if understood, but in my experience almost every 7B instruct model does this if you add something like "respond with markdown" to the system prompt.

Chatbot-UI (A ChatGPT UI clone) handles markdown nicely and does code rendering in real time.


> tends to give subjectively better responses to "bad prompts".

I wonder if a first pass with another model to expand these so called bad prompts into better prompts would work.


How worth it though? From my understanding it is trained on GPT3 code snippet outputs, is the Q4 model any good?


Any time I see humaneval comparisons I feel the need to point out that humaneval is only 164 questions. 66.5 vs. 65.9% is a difference between 109 and 108 solutions, a single question. Still interesting work though


At least they used HumanEval+, which adds a bunch more test cases and fixes some errors in the original benchmark!


Sentiment analysis of movie reviews is not what i would call an inspired or high-quality programming problem. This seems like too common examples from the training data being made manifest.

There's so many more interesting things you could be doing -- Even if you stick with the "use tf-idf on review data" how about sentiment analysis of vacation destination reviews segmented by season posted? Things like that lead directly into other ideas and possible metrics.

Creativity is really a sore spot with these. I suspect more elaborate prompts can suppress the commonalities but gpt-3.5 with a bog standard prompt gives bog standard ideas.


I think these smaller models really struggle with the reasoning aspect of writing decent code, I'm getting pretty nonsensical things out when asking it to fix fizzbuzz (though it made the right fix), like:

  - Replaced "i % 2 === 0" with "i % 3 === 0", because a number is divisible by 2 only if it's divisible by 3.


The model might have been trained with code that contained this exact comment


An author of Magicoder. I really appreciate every discussion (both positive and negative ones)! We release Magicoder with the hope that everyone can reproduce and improve it, by openly sharing all data and every code detail, which none of leading code model does. That’s the main reason we claim it to be fully open-source. Magicoder isn't perfect yet, but we are committed to constantly overcoming these challenges. Through the complete transparency, we believe an increasing number of advanced code models would come out soon.


A couple of tests I've done in their live demo.

- Implement the quicksort algorithm in python. I like this test because usually LLMs do the first time using list comprehensions, then I ask to avoid allocations.

- Write the game "snake" in python, and then in JS/HTML.

It succeeded on the simple quicksort implementation with list comprehensions, but failed when I asked for the in-place implementation. With the snake game tests, it didn't get even close.

In my experience these results are worse than non-code top 7B LLMs such as OpenChat/NeuralHermes, and significantly worse than plain Deepseek-coder-instruct 6.7B (which is the only small model I've tested that gets the "snake" game to work). When I ran these tests locally I used Q6 quantization (GGUF), and I assume their live demo is not quantized.

I find the benchmark results surprising, to the point I wonder if the benchmarks didn't leak into the training set.


Thanks for testing Magicoder! I tried “Implement an in place quicksort” with temperature 0 using the demo and here is the output I get. I haven’t checked every detail but it looks correct:

Here is a Python implementation of an in-place quicksort:

  def partition(arr, low, high):
      i = (low-1)
      pivot = arr[high]
  
      for j in range(low, high):
          if arr[j] <= pivot:
              i = i+1
              arr[i], arr[j] = arr[j], arr[i]
  
      arr[i+1], arr[high] = arr[high], arr[i+1]
      return (i+1)

  def quickSort(arr, low, high):
      if len(arr) == 1:
          return arr
      if low < high:
          pi = partition(arr, low, high)
          quickSort(arr, low, pi-1)
          quickSort(arr, pi+1, high)
  
  arr = [10, 7, 8, 9, 1, 5]
  n = len(arr)
  quickSort(arr, 0, n-1)
  print("Sorted array is:", arr)
This code first defines a partition function that takes an array and two indices, low and high, and rearranges the elements of the array so that all elements less than the pivot (the last element in the array) come before it, and all elements greater than the pivot come after it.

Then it defines a quickSort function that takes an array and two indices, low and high, and sorts the array in-place using the partition function. If the array has more than one element, it recursively sorts the subarrays to the left and right of the pivot.

Finally, it tests the function with an array of integers.


I used the prompt "Write snake game." on the live demo with temperature = 0 and max tokens = 2048. I got the correct implementation of the snake game in Python. But even after a few tries, I could not generate a snake game in HTML/JS.



I don't think the Open Source claim is accurate. From their repo "Magicoder models are trained on the synthetic data generated by gpt-3.5-turbo-1106 developed by OpenAI. Please pay attention to OpenAI's terms of use when using the models and the datasets."


OpenAI behaves as though everyone else's outputs are fair game for training LLMs, including GPT, but doesn't want others using GPT's output?

So, when OpenAI does it, it's transformative, but when we do it, it's not?

That's not right, and I don't think the courts will rule in their favor.


> That's not right, and I don't think the courts will rule in their favor.

Especially if you consider that all of OpenAI code training data comes from open Github repositories.


Yes, I totally agree. They are more than happy to make money off the backs of copyright holders.


> Please pay attention to OpenAI's terms

Just because people at a company tell you how to behave doesn't mean you need to comply


Or you can do the right thing and avoid distilling that stinky model, like they’ve asked.


Does the fact someone asks you not to do something inherently make it the right thing? Especially when what they've asked is hypocritical?


What makes it the "right thing" and who decided it was right?


All model outputs (text, image, video) are considered public domain in the US, so they can be used for whatever.


It's not true. Compiler output is not public domain. Why LLVM output should be different? Both ingests some and output some code as result.


Compiler output isn’t owned by the compiler writer, but the compiler user usually.


Not sure why you brought up compiler output, I'm talking about generative machine learning models.


Asked it to write code for two python programs (one of them tricky) and it got both of them right in the first pass. Looks promising.


I'm guessing this model was made to be small simply to keep costs low and make sure that they could feasibly train the model with the amount of time/effort they had. But to some extent I'm left wondering whether this technique would continue to be fruitful when scaled up to a huge model and with a bigger initial training set.


It was made to be small out of necessity. The US government put extensive export controls on many inter-GPU connectivity products last year and expanded those controls recently to include anything above an A100.

Page 9 of this recently published paper[1] is a strong indicator of how far non-US firms go to formally analyze and factor in these bandwidth constraints in building large models.

[1] https://arxiv.org/pdf/2311.15786v2.pdf


Ah, that makes sense. Is it possible for those researchers to just rent cloud compute, or is that also prohibited? I suppose that the obvious thought in my mind would be to find some cheap cloud GPU provider and use their platform to do the training. But maybe they're more concerned about inference afterwards, and so that doesn't really solve their issue.


could have been a blog post

this phase reminds me of when crypto projects all had pseudo academic “white papers” in order to be taken “seriously”


nice username for that comment.

i was toying around quite a bit with the ecosystem a couple years ago. particularly, comparing all the different sidechains and "layer 2" networks: those white papers were a godsend, because i could actually understand the _specific_ guarantees and assumptions each one made (and it made spotting the BS ones trivial). it's like `man` for the internet, and i quite like that.

i don't see the parallel between this PDF and cryptocurrency whitepapers.


Somebody should make a coomer meme but with AI “all you need” papers


All you need is all you need


Seriously! It was clever in the beginning but it's wearing on me seeing it everywhere now.


All you need considered harmful.


That'll take it full circle.


Just make one which is capable of creating the perfect React / Vue / Angular framework including upgrade paths and while at it using it for us so that we don't have to bother with reinventing the wheel every 3 years.


"the solution to runaway complexity is more complexity"

okay, more nuanced than that. but from the perspective of someone who spends far more time reading code (and patching it or packaging it) than writing it, i worry about that mindset.


[flagged]


Every time one of these come out I hope it DOES take all my dev jobs so I can focus on things that don't require writing code. Every time they fall really short.


I don't think it's possible in the strictest sense: beyond jr level none of us really 'generate code' as our main value add.

That said, I think ML advances may possibly usher in the next gen of low-code tools that may liberate a large portion of web devs from being human LLMs.


LLMs are not restricted to generating code either.

I think you are right w.r.t. Frontend, it's significantly easier than regular programming as you can(will be able to) simply enumerate a list or tree of the various style choices, generate them all, and allow people to A/B/C/D/...ZYX test between them.


I’ve been able to generate permutations for spreadsheets / curl pretty easily now. Thank goodness; I can never remember curl


I've met many so called senior devs who shit out code like there's no tomorrow, pretending their value add is based on the usage of the framework du jour. No reflection about tradeoffs, tech debt, onboarding of new devs, documentation, testability or even fitness to requirements. AI code generation is just gonna make them shit out More code and their managers will be none the wiser.


Nightmarish, my condolences.


I sure hope so.

Honestly, we shouldn't need AI/ML for this. We have so many well established need and use patterns that we should already have standardized "lego" style functionality blocks, optionally with interchangeable wrappers to provide customization options (common automatic options or even programmable ones).

The reason we're still all reinvinting wheels all the time is, frankly, because it's too easy to do our own thing. I would never advocate for this (and I'm not an MS guy), but perhaps if Microsoft had won the world and Linux had never existed, we would all be using the same MS libraries and possibly be further down the road than we are.


A parallel universe where the internet is built with no code Windows Forms. That’s not exactly heaven.


One might imagine that if all the creative talent that is currently spread across a dozen programming languages and dozens of frameworks were focused on a much narrower set of possibilities, we would arrive at better overall options sooner.

What I've observed over 20 years is the opposite - a continual expansion, even an explosion of competing (often uncompelling) alternatives.


We will get there eventually. I sort of want LLMs to write code, but more so I want LLMs to help library authors and ecosystem maintainers write better and more cohesive libraries and ecosystems. What if LLMs wrote beautiful code, like the early Lispers dreamed of :-). And by beautiful, I mean an intern with no coding experience at work comes into a codebase and could be productive with hand writing code (like they did in the ol' days before washing machines) but they'll probably just fire up the LLM.

What I am imagining is LLM crafts the code in such a way that you can look at a small set of certain files and get a real understanding of the story of that program or service. Some people write code like that but most (including me) can't unless I spend a lot of time crafting, which is not a luxury we get at work.

Pedagogical code is probably the word I am looking for!

With that you could get an LLM to write a new cryptography library, and assuming you are a crypto expert, you could quickly verify that it is doing the right thing, as it makes it obvious.


alas all of the benchmarks are based around totally self contained problems.


I don't see this creating new algorithms (as in, not in the training corpus), but maybe giving the kind of answer you would expect from Stack Overflow, without all the social fluff around it (comments, badges and so on).

The day one of these find new algorithms to solve problems with better complexity or simpler code that state of the art, I'll wake up. When I give a LLM a computational geometry problem, it's exactly like a student trying to bullshit his/her way through an exam without any actual deep understanding.

For example, I ask for an algorithm to compute Laguerre Voronoi diagrams (usually not available in books or code examples), and I get answers for plain Voronoi diagrams, because it's what you will find in many books and code samples. Generating boring but necessary code, in moderation, is a win.


A model for algorithms would not be trained on code but mathematics with some limited psuedocode.

IIRC there are some people looking at theorem proving with LLMs, but the reality is they don't have to do anything foundational or groundbreaking to be of value in assisting, supplanting, or replacing the vast majority of people who interact with computer code.

We are talking about the field that openly just memorized leetcode questions to "make it", right?

Why should we have such high standards for tools when we don't even apply them to the "best and brightest"


Dude you could be Harry Kim on the holodeck creating whatever you can think of. What's wrong with that?


We have to go through the Eugenics Wars first. I am NOT wearing that awful quilted whatever-it-was tunic the soldiers apparently wore.


But how if ai bros keep stealing ip to build their little models.


Take the laws that enrich them allow away from them.

They’re just people. Not divine mandates. There is zero real obligation to serve contemporary socio-political and economic norms.

Not really finding any of this progress in computing shocking though. Unix and that model wastes a lot of resources dealing with strings.

What a shock we reduce the amount of fluff to just enough symbolic logic to instigate appropriate electron state to solve a problem, rather than brag about a new DSL to tokenize, parse and logic to process bespoke template syntax in yet more asinine “file formats” …once you remove the chimps chasing Shakespeare, superfluous state doing nothing but propping up venture backed coder boot camp grads jobs we find super powerful software.

Shocking, I’m shocked.


Huh?


maybe he is trying to make it sound like treknobabble but contemporary to the current conversation on ai?


I mean it does fit the trend, but I am confused by the author of the comment defending ip theft by ai. Heists never end well, particularily when done by those percieved to have overwhelming power (ie openai has billions of $ to spend, while the individual where content is stolen from have significantly less).


“Heist”

Huh?

Disagree with the use of that term.

If anything copyright is the heist. Disney subsists on back breaking labor of others while it wags around papers saying it owns things that make it rich enough to avoid real work.


?


Cola wars?


whack-a-mole with coder wars


This looks like a llama2 finetune, so the dataset (inclusive of llama2) isn't fully open as claimed, and I'd still have to accept the Facebook and possibly OpenAI licenses.

Let alone that clearly the base model was built on non-source-code, so their premise doesn't hold.

Disappointing.


The number of entities that are possibly constrained by the llama 2 license can be counted on two hands and all of them have the ability to train models that can match Llama 2's performance.


But it's transformative finetune. Why licenses of sources shouldn't apply to original LLVM, but applied to finetune, for which LLVM is just another source?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: