Skip to content

Learning Notes

Self-Extend LLM Context Window

The "Self-Extend" method is designed to tackle the positional out-of-distribution (OOD) issue in LLMs (with relative position encodings such as RoPE). This issue arises when LLMs encounter text sequences during inference that exceed the length of their pretraining context window. In such cases, LLMs are exposed to new relative distances not present during their pretraining phase, leading to unpredictable behaviors. (Thats because, with unseen relative positions, the attention distributions are very different compared to those within the pretraining context window length)

The core idea of "Self-Extend" is to remap these unseen large relative positions to those encountered during pretraining. This is achieved using a simple floor division operation (//) as the mapping function.

The "Self-Extend" method innovatively employs a dual attention mechanism to extend the context window of LLMs. This mechanism consists of two types of attention: grouped attention and normal attention.

  1. The grouped attention is designed to handle tokens that are at a long distance from each other. In this mechanism, the FLOOR operation is applied to the positions of these distant tokens.
  2. The normal attention mechanism remains unchanged from the pretraining stage and is used for tokens that are in close proximity, essentially neighbors. This aspect of the attention mechanism ensures that the model retains its original, finely-tuned ability to process and understand the immediate context around a given token.

self_extended

Note that Self-Extend only modifies the attention mechanism during inference and it does not require any fine-tuning or training.

More about it in the paper: LLM Maybe LongLM


Continuous Batching

In the unoptimized method, the language model processes one sequence at a time, adding a single token (word or punctuation) in each step. For example, given the input "Mark is quick. He moves", the model adds one word at a time until the sequence ends. This method is slow, especially for longer texts.

The batched generation method improves efficiency by processing multiple sequences simultaneously. Sequences are padded with filler tokens to make them equal in length, and these tokens are masked so they don't influence the generation. For instance, the model can simultaneously process multiple sentences in one pass, adding a word to each sequence.

Continuous batching further optimizes this process. When a sequence completes, new sequences are inserted into the batch to replace the completed ones. This approach prevents wasting computational resources on generating unnecessary tokens after a sequence ends. It's more efficient because it leverages the model's capacity to handle multiple sequences in parallel, thus reducing overall processing time and resource usage.

COB

More about it in the blog: Continuous batching in LLM inference


MoE's: Mixture of Experts

MoE models use sparse layers of specialized sub-networks (experts) and a gate network (router) that determines which tokens are sent to which expert. Each expert is specialized in different aspects of the data. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself.

Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a dense model with the same number of parameters.

MoE

More about it in the blog: Mixture of Experts Explained


Model Merging: A Look into DARE

DARE: Introduced a simple yet impactful concept for effortlessly creating a hybrid model, by combining weights of two models from same base model finetuned on two different tasks/data. (One on Math, One on Code)

In DARE first identify what so called delta parameters by taking difference between actual weights and finetuned weights, then pick those with small difference value may be 0.005.

In Four Steps:

  • Randomly pick may be 90% of those Delta parameters
  • Drop them by resetting actual weights to zero
  • Rescale remaining ones with 1/(1-p) (where p is drop rate).
  • Finally do merging with this updated weights of two or more finetuned models using any of exisiting model merging techniques.

DARE

More about it in the paper: Language Models are Super Mario


Last update: March 21, 2024