I don’t see this in the article. Has Anthropic explained the mechanism by which they were able to cost-effectively expand the context window, and whether there was additional training or a design decision (e.g. alternative positional embedding approach) that helped the model optimize for a larger window?
Interesting. Does the decision to use ALiBi have to be done before the model weights are first trained, or is there a way that these models could have incorporated ALiBi instead or in addition to an alternate positional encoding method to ALiBi after they were first trained?
The decision needs to be made before starting training. Maybe there is a clever way to add it after the fact in the style of LoRA? First, that would be a different method in its own right (just as LoRA is), second, I can't see how to do so easily. But then I just thought about it for a minute.