what are model weights?

detrites · on April 3, 2023

A large array of uniquely-set floating point values. (AKA "parameters".)

In a language model, a word is put in one end (as a numerical index to a wordlist), and then it and the weights multiplied together, and then a new word comes out (again as an index).

Numbers in, numbers out, and a small bit of logic that maps words to numbers and back at either end. ("Encodings".)

"Training" is the typically expensive process of feeding huge amounts of data into the model, to get it to choose the magic values for its weights that allow it to do useful stuff that looks and feels like that training data.

Something else that can be done with weights is they can be "fine-tuned", or "tweaked" slightly to give different overall results out of the model, therefore tailored to some new use-case. Often the model gets a new name after.

In this case, what's been released is not actually the weights. It's a set of these tweaks ("deltas"), which are intended to be added to Meta's LLaMA model weights to end up with the final intended LLaMA-based model, called "Vicuna".

MuffinFlavored · on April 3, 2023

> A large array of uniquely-set floating point values.

How large? How many elements?

skrblr · on April 3, 2023

It's in the name of the model - "Vicuna-13B" implies there are 13 billion parameters.

MuffinFlavored · on April 4, 2023

the way these LLMs work, there is a weight for each parameter? 13 billion weights? what is an example of a parameter?

dragonwriter · on April 4, 2023

A parameter is a variable for which a weight (a floating point value) is the concrete value.

kragen · on April 4, 2023

a weight is an example of a parameter

so is a bias, and presumably the biases are also in the same file with the weights

superkuh · on April 3, 2023

Essentially a computer neural network is just a lot of addition (and matrix multiplication) of floating point numbers. The parameters are the "strength" or "weights" of the connections between neurons on different layers and the "bias" of each neuron. If neuron Alice is connected to neuron Bob and Alice has a value of 0.7, and the weight of Alice's connection to bob is 0.5, then the value sent from Alice to Bob is 0.35. This value (and the values from all the other incoming connections) are summed at added to the neuron's negative bias.

I highly recommend checking out 3blue1brown series on how neural nets, gradient descent, and the dot product (implemented as a matrix multiplication) all tie together: https://www.youtube.com/watch?v=aircAruvnKk

mdaniel · on April 3, 2023

To add to this excellent reply, I'll also point out that the reason folks want the weights is that they are the result of a massive search operation, akin to finding the right temperature to bake a cake from all possible floats. It takes a lot of wall clock time, and a lot of GPU energy, and a lot of input examples and counter-examples to find the "right" numbers. Thus, it really is better -- all things being equal -- to publish the results of that search to keep everyone else from having to repeat the search for themselves

dragonwriter · on April 3, 2023

> a massive search operation, akin to finding the right temperature to bake a cake from all possible floats

...for each of 13 billion (for a model with that many parameters) different cakes, except that they aren’t like cakes because the “best" temperature for each depends on the actual temperatures chosen for the others.

EVa5I7bHFq9mnYK · on April 4, 2023

It's 2^(16*13,000,000,000) different cakes.

swader999 · on April 4, 2023

Way better than paperclips.

holoduke · on April 3, 2023

Why would a 4bit quantized model be less accurate than a 16?

mdaniel · on April 4, 2023

My lay-person's understanding is that it's due to the problem one is trying to solve with a deep learning model: draw a curve through the dimensions which separates "good" from "bad" activation values. The lower resolution the line, the higher likelihood the line will fit sometimes and veer off into erroneous space others

imagine trying to draw the blue line on the right using only lego blocks: https://youtu.be/QDX-1M5Nj7s?t=1202

discussion: https://news.ycombinator.com/item?id=35405338

dragonwriter · on April 4, 2023

Because 4 bits less precisely specifies the value of the parameter than 16 bits does.

ozmodiar · on April 3, 2023

They basically encapsulate what a model has "learned." ML models without their weights are useless because the output is essentially random noise. You then train the model on data, and it changes the weights into numbers that cause the whole thing to work. Training data and processing power are usually very expensive so the resulting weights are valuable.

MMMercy2 · on April 3, 2023

They are the parameters of this large language model. There are 13B fp16 numbers.

tomp · on April 3, 2023

the secret sauce of AI

zhisbug · on April 3, 2023

lol weights are all you need