Hacker News new | past | comments | ask | show | jobs | submit login

I don’t see this in the article. Has Anthropic explained the mechanism by which they were able to cost-effectively expand the context window, and whether there was additional training or a design decision (e.g. alternative positional embedding approach) that helped the model optimize for a larger window?



No. As far as I know, they haven't said anything about this. Neither did OpenAI about gpt-4-32k.

MosaicML did say something about MPT-7B-StoryWriter-65k+: https://www.mosaicml.com/blog/mpt-7b. They are using ALiBi (Attention with Linear Biases): https://arxiv.org/abs/2108.12409.

I think OpenAI and Anthropic are using ALiBi or their own proprietary advances. Both seem possible.


Interesting. Does the decision to use ALiBi have to be done before the model weights are first trained, or is there a way that these models could have incorporated ALiBi instead or in addition to an alternate positional encoding method to ALiBi after they were first trained?


The decision needs to be made before starting training. Maybe there is a clever way to add it after the fact in the style of LoRA? First, that would be a different method in its own right (just as LoRA is), second, I can't see how to do so easily. But then I just thought about it for a minute.


a lot of people are speculating online (https://twitter.com/search?q=anthropicai%20alibi&src=typed_q...) but i'm guessing it's ALiBi, which was also used by MPT-7B to get up to 85k long context


No, they are playing this close to the chest, similar to how OpenAI achieved 32k context limit.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: