Extending Context Length in Large Language Models


#Extending #Context #Length #Large #Language #Models

How to turn your Llama into a Giraffe

Donato Riccio
Towards Data Science
Image by the author. (AI generated Llamas)

Context length refers to the maximum number of tokens the model can remember when generating text. A longer context window allows the model to understand long-range dependencies in text better. Models with longer contexts can build connections between ideas far apart in the text, generating more globally coherent outputs.

During training, the model processes the text data in chunks or fixed-length windows. Models need to be trained on lengthy texts to actually leverage long contexts. Training sequences must contain documents, books, articles, etc., with thousands of tokens.
The length of training data sets a limit on usable context length.

So, why don’t we train models on longer sequences?

Not so fast.

Increasing context length increases the number of possible token combinations the model must learn to predict accurately.
This enables more robust long-range modeling but also require more memory and processing power, leading to higher training costs.

Without any optimization, computation scales quadratically with context length — meaning that a 4096 token model will need 64 times more computation than a 512 token model.

You can use sparse or approximate attention methods to reduce the computation cost, but they may also affect the model’s accuracy.

Training and using large context language models presents three main challenges:

  • Fitting long contexts into the model.
  • Accelerating inference and training so they don’t take forever.
  • Ensuring a high-quality inference that maintains awareness of the full context.

The attention mechanism is the core component of transformer models. It relates different positions of a sequence to compute its representation, allowing models to focus on relevant parts of the text and understand it better. Scaling transformers to longer sequences faces challenges due to the quadratic complexity of full attention.